Proposal: minimalistic core API for recognizers

Considered

This page is dedicated to discussing this specific proposal

ContributeContributions

BerndPaysanavatar of BerndPaysan [160] minimalistic core API for recognizersProposal2020-09-06 09:40:07

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )

XY.3 Additional usage requirements

XY.3.1 Data type id

rectype: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )

state is:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i?x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;

: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;

' minimal-recognizer is forth-recognizer

Testing

JennyBrienavatar of JennyBrien

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE do something like:

    -2 RECTYPE-X

But mostly you'll have the RECTYPE sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:

    : postponed  -2 swap execute ; 

and

    : postponed  @ execute ;

Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)

Compare:

: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;

with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 ;noname name>compile execute ;
 ;noname name>compile swap lit, compile, ;  rectype: rectype-nt

BerndPaysanavatar of BerndPaysan

One possible thing is to have an automatic postpone for literals.

: rectype-lit: ( compile-xt "name" -- )
  create ,
  does> @ swap
  case
      0  of  drop  endof
      -1 of  execute  endof
      -2 of  dup >r execute r> compile,  endof
  endcase ;

' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string

This works with this method, but not with the previous way.

BerndPaysanavatar of BerndPaysan

Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define

: rectype: ( xt-int xt-comp xt-post "name" -- )
  create , , , does> swap 2 + cells + @ execute ;

and then define generic rectypes just like in Matthias Trute's version with rectype:

JennyBrienavatar of JennyBrien

: rectype-lit: ( xt -- )  ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;

not so straightforward, but possible.

ruvavatar of ruv

Previous works

In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: ( i*x token -- j*x ). I described this approach in comp.lang.forth in 2018 (news:pngvcc$pta$1@gioia.aioe.org).

Bernd should also remember comparison of version D with Resolvers API, where I specified this approach, and even several POCs.

and then define generic rectypes just like in Matthias Trute's version with rectype:

I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).

But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.

This works with this method, but not with the previous way.

Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.

: create-rectype-for-literal ( xt-compiler "name" -- )
  ['] noop swap dup rectype:
;

Token translator

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves

RECTYPE-SOMETYPE ( i*x state -- j*x )

By convention, the name for such a word should start from an English verb.

Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.

E.g.:

: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit  r> tt-lit ;

VS

: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s  r> r>  tt-lit-s ;

Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?

Terminology

Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.

Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).

ruvavatar of ruv

Advantages

A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!

BerndPaysanavatar of BerndPaysan

Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.

ruvavatar of ruv

@JennyBrien wrote

Compare: [...] with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 :noname name>compile execute ;
 :noname name>compile swap lit, compile, ;  rectype: rectype-nt

(sic: the full postpone action).

This comparison is incorrect since in the proposed API rectype: (that generates a token translator) can be defined as the following:

: rectype: ( xt-executer xt-compiler xt-postponer "name" -- )
  >r >r >r : ]]
    0  of  [[ r> xt, ]] endof
    -1 of  [[ r> xt, ]] endof
    -2 of  [[ r> xt, ]] endof
    -22 throw
  endcase [[ postpone ;
;

And you can use the same your code to define your rectype-nt or anything else.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- rectype )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3 Additional usage requirements

XY.3.1 Data type id

XY.3.1 Translator

rectype: subtype of xt, and executes with the following stack effect:

translator: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )
SOME-TRANSLATOR ( i*x -- j*x )

state is:

A translator depends on STATE to translate the given arguments:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i?x is the additional information provided by the recognizer.

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6 Glossary

XY.6.1 Recognizer Words

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;
' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- rectype )
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

A translator depends on STATE to translate the given arguments:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysanavatar of BerndPaysan

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult. Instead of

: postpone ( "name" -- ) parse-name forth-recognizer -2 swap execute ; immediate

it is more convoluted

: postpone ( "name" -- )
  parse-name forth-recognizer
  state @ >r -2 state !  catch  r> state !  throw ; immediate

How to detect [[ at the end of a postpone sequence is also not so trivial.

ruvavatar of ruv

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult.

It's OK. Actually, we distribute complexity among various parts. When we make one thing less complex, we make another thing more complex. But due to the different numbers of occurrences of various things (in systems, libraries, programs) the summary complexity can be less or more.

This approach also makes some things more complex, but the summary complexity decreases, I believe.

Concerning POSTPONE. I think, some useful parts should be factored out.

Also, we don't need to catch exception — usually, it's a stop error, and the state is ambiguous in any case. QUIT resets all the internal states. Concerning programs — we need a standard way to reset the internal states of the Forth text interpreter, regardless of Recognizers proposal.

In my "lexeme resolvers" implementation I use conception of postponing level that can be 0, 1, 2, and introduce the words to increment and to decrement this level. So, POSTPONE is defined as the following:

: postpone  ( " name" --      )   parse-name inc-state translate-lexeme dec-state ( flag ) ?nf ; immediate

Where translate-lexeme is defined as the following:

: perceive-lexeme ( c-addr u -- k*x xt-tt | c-addr u 0 )
  perceptor dup if execute then
;
: translate-lexeme ( i*x c-addr u -- j*x true | c-addr u 0 )
  perceive-lexeme dup if execute true then
;

(Note that in contrast of this proposal, resolvers return ( c-addr u 0 ) on fail)

How to detect [[ at the end of a postpone sequence is also not so trivial.

An appropriate approach is that the word ]] is a parsing word.

: ]] ( -- )
  inc-state begin
    next-lexeme 2dup s" [[" equals 0= while
    translate-lexeme ?nf
  repeat 2drop dec-state
; immediate

So we don't have any problem to detect [[ at the end.

An advantage of the postponing level conception is that the following code works as expected:

: foo [  ]] 123 . [[  ]  ;   foo \ prints 123

In the message news:rdcur5$ga4$1@dont-email.me (the full message: news:rdcn35$sd2$1@dont-email.me) I showed another approach, when postponing action is not required at all (i.e., -2 state in this proposal).

ruvavatar of ruv

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

It's correct in the general case, but it makes a little sense, since any definition meets this stack effect.

So I think we should distinguish the parameters of a translator itself from the effect of translating of the code that is passed to the translator. Possible variants:

\ We can define 'token' data type
TRANSLATE-SOMETOKEN ( i*x token -- j*x )

\ Some hybrid variant
TRANSLATE-SOMETOKEN  ( i*x token{k*x} -- j*x )

\ Only low level data types
TRANSLATE-SOMETOKEN  ( i*x k*x -- j*x ) 

(NB: I use a conventional naming {verb}-{noun} for such a words).

It should be also noted that these x may be distributed in all the stacks: the data stack, the floating-pint stack, the control-flow stack (except token k*x, that cannot be in the contrlo-fow stack).

BerndPaysanavatar of BerndPaysan

Indeed, TRANSLATE-SOMETHING sounds better than SOMETHING-TRANSLATOR.

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

ruvavatar of ruv

"FORTH-RECOGNIZER" name

I thought about FORTH-RECOGNIZER name. It makes a strong impression that this word is similar to FORTH-WORDLIST ( -- wid ). The problem is that it isn't.

FORTH-WORDLIST is a constant (it always return the same value), that indicates a one the same word list among all the word lists. This word list can be included into the search order, and it can be absent in the search order.

By analogy, FORTH-RECOGNIZER should be a constant that indicates a one the same recognizer among all the recognizers. This recognizer can be included into the recognizer that is used by the Forth text interpreter, and it can be absent in the recognizer that is used by the Forth text interpreter. (In accordance with the conception that a sequence of recognizers is also a recognizer).

All these should be right to hold consistent naming. But actually it is wrong. It means, that this name breaks consistency and isn't inappropriate for the proposed word.

FORTH-RECOGNIZER ( -- xt ) can be a word that returns xt of the system's recognizer that is used by the Forth text interpreter by default (i.e. initially).

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

Also, it makes a strong impression that it returns a recognizer. But it's wrong. Also, it's result is analyzed much more often than it's followed by EXECUTE.

Basic methods

By no means, we need

  1. a method that tells the Forth text interpreter to use a given recognizer.
  2. a method that returns the recognizer that is currently used by the Forth text interpreter,
  3. a method that performs the recognizer that is currently used by the Forth text interpreter

A one differed word (a vector) X can solve it:

  1. set: IS X
  2. get: ACTION-OF X
  3. perform: X

But I insist that this approach limits implementations too much. A Forth system can want to perform its internal actions on switching the recognizer that is used by the Forth text interpreter. But it cannot do it, if this recognizer is switched via IS X method. For that, the different getter and setter words are usually provided in the Standard (except very ancient BASE and >IN — due to back compatibility). Yes, perhaps Gforth can attach any additional internal actions for IS X phrase. But we shouldn't complicate all Forth system implementations.

A possible implementation via deferred word and distinct getter and setter words:

defer perceive ( c-addr u -- k*x tt )
: perceptor ( -- xt ) action-of perceive ;
: set-perceptor ( xt -- ) is perceive ;

Perhaps, the more specific names are better (?):

defer perceive-lexeme ( c-addr u -- k*x tt )
: lexeme-perceptor ( -- xt ) action-of perceive-lexeme ;
: set-lexeme-perceptor ( xt -- ) is perceive-lexeme ;

ruvavatar of ruv

Correction: pleas read "By anyway, we need" instead of "By no means, we need".

BerndPaysanavatar of BerndPaysan

´DEFERis a core word now, so usingDEFER` for such a thing is ok. We don't need a special getter and setter for everything.

The implication that FORTH-RECOGNIZER returns a recognizer (and does not, it executes one) is a valid point. A better name is needed. At the moment it is a VALUE and does return a recognizer. Now, it is a deferred word, and does recognize strings. We should keep it with Anton's unification: a sequence of recognizers can be combined to one recognizer. Just because it's now recognizing more different things, it's still a recognizer. No need to find another synonym. Takes string, returns data+translator token ? is a recognizer.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

ruvavatar of ruv

DEFER is a core word now, so using DEFER for such a thing is ok.

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

Back to my first argument, what do you suggest if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter?

You can ask, do I have an example of such requirement. Yes, I do. I want to provide a method to undo such switching in my system. It's similar to effect of the "PREVIOUS" word for the search order. Perhaps you can suggest some solution with the deferred word?

Anton's unification: a sequence of recognizers can be combined to one recognizer.

Yes. I too said that any sequence of recognizers seq-x (from API v4) can be represented as a single recognizer : recognize-x seq-x recognize ;. So, sequences are excessive in the basic API, — a Forth system doesn't need to know is it a sequence or not.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

It's better. But it recognizes not valid FORTH, but anything what the Forth text interpreter can currently recognize (and only that).

Conceptually, this word isn't just a recognizer. There is a single special system's slot for a recognizer that is used by the Forth text interpreter. We can put any recognizer into this slot. We can also perform the recognizer that is placed into this slot. So this word performs the recognizer from this slot. I incline to call this slot "perceptor". And after that the word that performs the recognizer from this slot becomes "perceive".

All recognizer names have the pattern RECOGNIZE-*. The idea is to not put this special word on a par with all other recognizers. For that, its better to find a name that is distinct from the RECOGNIZE-SOMETHING pattern. What do you think?

ruvavatar of ruv

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

This argument is that a Forth system can be implemented as a minimal kernel and additional libraries. And DEFER, IS, ACTION-OF can be available via a library. But when we put a deferred word into this API, we force a system's author to put DEFER, IS, ACTION-OF into the kernel too. But actually they isn't required in the kernel. It would be too restrictive limitation on the implementations.

ruvavatar of ruv

Locate

locate cannot work for lexemes that can be recognized (translated) according to this proposal.

ruvavatar of ruv

The last comment was intend for the proposal of AndrewHaley, and it was mistakenly placed here.

BerndPaysanavatar of BerndPaysan

The recognizer will be an option, as well. At the moment, FORTH-RECOGNIZER is proposed to be a value. That's also a CORE EXT word (as is TO).

A minimalistic system that wants to implement recognizers needs FORTH-RECOGNIZER to be a deferred word. I.e. it needs code for DODEFER. It can load the rest of the deferred word stuff later as extension.

ruvavatar of ruv

Certainly, recognizers is an option. I didn't mean that some required part requires an optional part. I mean that one optional part requires another complex optional part without any good and fair ground.

Yes, a minimalistic system that wants to provide a deferred word needs only code for DODEFER. But it still makes bootstrapping of this system more complex. Hence, when we put a deferred word into API, we make things more complex for some implementations. But we don't even have a rationale for that.

Also, with deferred word we still don't have a solution if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

BerndPaysanavatar of BerndPaysan

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e.

forth-recognizer @ execute execute

Clumsy interfaces can not be changed if you have better things at hand. You can probably wrap around the clumsy interface, e.g.

Defer recognize-forth
addr recognize-forth Constant forth-recognizer

if you can use ADDR to access the deferred word's xt storage location. But then you have another interface, less clumsy, and only available when you have DEFER+ADDR (and ADDR is not even part of the standard).

A minimalistic API, as what I am looking for here is one where you don't have to document much. The less uniform an API is, the more you have to document. The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect. And combinations of recognizers have the same effect. And the system's recognizer is just another one, which you can swap in and out. And you can define a REC-SEQUENCE, where you can manipulate the sequence, and put that into the system's recognizer.

This uniformity is broken when you don't use a deferred word for the system's recognizer — you can't just call that one as you can call the others. You need @ EXECUTE. This is clumsy.

ruvavatar of ruv

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e. forth-recognizer @ execute execute

I don't suggest to use a variable in the interface, — it's even worse than a defer. When a variable is used to change something, this changing cannot be effectively detected. But the requirement is: an ability for a system to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

For that I would prefer to have the separate words in the API: a setter, a getter and a "performer" (a word that performs the recognizer that is currently used by the Forth text interpreter).

What are your objections to have several separate words in the minimalistic API?

The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect.

I strongly support this approach (and I myself suggested this approach too, with slightly different stack effects).

This uniformity is broken when you don't use a deferred word for the system's recognizer

It seems, the set of words like the following (the names may vary):

perceive ( c-addr u -- k*x tt )
set-perceptor ( xt -- )
perceptor ( -- xt )

doesn't brake the mentioned uniformity. Please, clarify.

BerndPaysanavatar of BerndPaysan

Using special setters and getters means you have another (special purpose) DEFER mechanism here. Of course you can implement that with

variable current-perceptor
: perceive ( addr u -- i*j token ) current-perceptor @ execute ;
: set-perceptor ( xt -- ) current-perceptor ! ;
: perceptor ( -- xt ) current-perceptor @ ;

which is probably a bit less implementation effort than DEFER, IS, and ACTION-OF. Or really?

State-Smart:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body state @ if  ]] literal ! [[  else  !  then ; immediate
: action-of  ' >body state @ if  ]] literal @ [[  else  @  then ; immediate

or with NDCS:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body ! ; ndcs: ' >body ]] literal ! [[ ;
: action-of  ' >body @ ;  ndcs: ' >body ]] literal @ [[ ;

DEFER is really a lightweight way to define words that can be changed.

These three lines of code are doing more than the three lines of code you need in addition when you have your special-purpose setter and getter, but they are still one-liners.

Forthers like to reinvent the wheel. But don't overdo this.

ruvavatar of ruv

Using special setters and getters means you have another (special purpose) DEFER mechanism here.

Not necessary. It's up to an author/implementer. It can be just wrappers over standard DEFER, as I shown earlier. So it doesn't mean reinventing the wheel. The implementation details are just hidden.

So the arguments concerning implementation of DEFER mechanism say nothing against three separate words in the minimalistic API.

BTW, having translators for the basic data types, the words is and action-of can be even shorter:

: is  ' >body tt-lit ['] ! tt-xt ; immediate
: action-of  ' >body tt-lit ['] @ tt-xt ; immediate

Well, in any case I would agree that the arguments concerning complexity are more or less weak.

A strong argument (that wasn't yet commented) is about additional actions that a system needs to perform in the setter. What do you thing in this regard?

ruvavatar of ruv

One more strong argument against DEFER word in the API, and pro the different getter and setter is following.

Having DEFER in the API, we cannot define this API over another API at all. But having the different getter and setter (and "executer") — it's possible to defined this API over some other APIs.

Example: news:rn1csa$b02$1@dont-email.me

BerndPaysanavatar of BerndPaysan

Gforth's new header structure allows to overload TO, IS (which are essentially the same) and DEFER@, so we can use the DEFER API to access similar changeable execution patterns implemented differently. So for us, it makes sense to use these access words, regardless how it is implemented.

Other systems may not have this capability, though the way the standard now extends TO for FVALUE and others, you need to have one way or the other to deal with that. Same, when you have an UDEFER in your system for user-specific deferred words.

For me, it is needless clutter of the dictionary and the mental space of the programmer to add setters and getters for things where you already have a generic one. But I see the point that not every system can do this.

ruvavatar of ruv

needless clutter of the dictionary and the mental space of the programmer

I used an approach when a defined word creates two words — a getter and a setter. It's something like after the phrase create-prop x the words x and set-x are created. I didn't noticed any mental space clutter in this regard. Sometimes I redefined set-x to add additional checks or actions.

Concerning dictionary space — I don't see any problem.

But I see the point that not every system can do this.

True. And even if a system can do this, it's done in some system specific way only.

So, due to the combination of all reasons, it's better to have distinct ordinary words in the standard API.

StephenPelcavatar of StephenPelc

If people are interested, I can arrange a virtual meeting for recognisers. They have been workshopped at various Forth Standards meetings but little of substance has emerged so far. I would suggest that such a meeting concentrate on finding what we can agree on.

Note that Forth-200x meetings are public, and the use of real names is strongly encouraged.

ruvavatar of ruv

If people are interested, I can arrange a virtual meeting for recognisers. ... concentrate on finding what we can agree on.

I like this idea.

If people are interested, I will prepare before the meeting a proof of concept — an implementation of Recognizer API v4, Nestable Recognizer Sequences, or some other over this API.

Perhaps, somebody could share his list of questions before the meeting. My list at GitHub.

StefanKavatar of StefanK

A small remark to the POSTPONE test.

We can factor postpone in two parts with state-execute similiar to base-execute:

   : state-execute ( xt s -- )  state@ >r state ! catch r> state ! throw ;
   : POSTPONE ( "name" -- ) parse-name forth-recognizer -2 state-execute ; immediate

That's not very difficult anymore.

StefanKavatar of StefanK

IMHO the idea to use a deferred forth-recognize is good and more flexible than a stack of recognizers. But the translator xt makes postpone more difficult. But we can factor postpone into two parts. One that restores the stack contents at runtime similar to lit,, and one that does the compilation. If we use rectype, similar to the proposal of recognizers from 2018, but with lit, as third method, we get an easy postpone and '. Here, we can reuse the compile method directly by the second factor of postpone.

variable state

: translate>interpret @ ;
: translate>compile   cell+ @ ;
: translate>lit,      cell+ cell+ @ ;
\ Well, its translate>*lit, in fact; i.e. regenerate ( i*x ) at runtime.

Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer perform ( i*x translator -- j*x )

: perform>interpret translate>interpret execute ;
: perform>compile   translate>compile   execute ;

: on  -1 swap ! ;
: off  0 swap ! ;

: [ ['] perform>interpret is perform state off ; IMMEDIATE
: ] ['] perform>compile   is perform state on ;

\ alternativly:
\ :noname is perform state ; dup
\ : [ ['] perform>interpret [ compile, ]  off ; IMMEDIATE
\ : ] ['] perform>compile   [ compile, ]  on ;
\ another alternative:
\ : perform state @ IF translator>compile ELSE translator>interpret THEN execute ;

' [ execute \ initialize state and perform
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer perform
  REPEAT ;

: lit,  ( n -- )  lit lit , ,  ; \ or postpone literal
: throw-13 -13 throw ;

: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;

' throw-13 dup dup translator notfound

' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

' lit,
' lit,
:name ; \ noop
translator translate-const-cell

: rec-nt ( addr u -- nt translate-nt | notfound )
  forth-wordlist find-name-in dup IF  translate-nt  ELSE  drop notfound  THEN ;
: rec-num ( addr u -- n translate-const-cell | notfound )
  0. 2swap >number 0= IF  2drop translate-const-cell  ELSE  2drop drop notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator | n num-translator | notfound )
  2>r 2r@ rec-nt dup notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

\ simple postpone
: postpone ( "name" -- )
    parse'n'recognize
    dup translator>compile >r translator>lit, execute r> compile,
; IMMEDIATE

\ postpone optimized for immedate words
: postpone ( "name" -- ) \ optimized for immediate words
    parse'n'recognize \ ( i*x translator )
    dup translate-nt = IF ( nt translator )
        over immediate? IF drop >cfa compile, exit THEN
    THEN
    dup translator>compile >r translator>lit, execute r> compile, ;
; IMMEDIATE

: ' ( "name" -- xt ) parse'n'recognize translate-nt <> IF throw-13 THEN >cfa ; IMMEDIATE

ruvavatar of ruv

@StefanK, thank you for your participation. But it looks like you have missed too many arguments discussed above.

For example, a deferred word forth-recognizer has a confusing name, and it cannot be acceptable in the API, since it's difficult for the Forth system to detect when its value is changed (NB: it isn't an argument in favor of "stack of recognizers").


: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;
' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

Also, could you please stick to a consistent and clear terminology?

In this example you create not a token translator, but a named token descriptor (and the corresponding token descriptor object). See Common terminology for recognizers (improvements and critics are welcome).

   token descriptor object: an implementation dependent data object that describes how to interpret, how to compile and how to postpone (if any) a token .

I also proposed the following naming convention for the corresponding words:

  • For token translators use names in the form tt-* — that is the abbreviation of translate-token-*; for example, tt-lit, tt-nt.
  • For token descriptors use names in the form td-*— that is the abbreviation of token-descriptor-*; (for example, td-lit, td-nt)

The employed approach in your example to create a token descriptor can be called "three components" approach. A significant disadvantage of this approach is that it doesn't provide a way to reuse old descriptors when you create a new descriptor. Compare to token translators — they can be easily reused to create new token translators. For example, a token translator for a pair ( nt nt ) can be created using the token translator tt-nt for a single nt as:

: tt-2nt ( i*x nt nt -- j*x ) >r tt-nt r> tt-nt ;

To create a token descriptor td-2nt in the three components approach, you need to put in a lot more effort, and you cannot reuse td-nt descriptor.

One possible solution is to don't expose the three components approach in the API and instead provide a special method to create a descriptor from another descriptors. For td-2nt it can look as:

  tt-nt dup 2 descriptor constant tt-2nt
  \ or
  td{ tt-nt tt-nt }td constant tt-2nt

It seems, a user never needs to provide three components for a new descriptor since any new descriptor is always based on some already defined descriptors.

But the approach based on the token translators is far simpler.


By the way, a well known word to get xt from nt is name> ( nt -- xt )(see Forth-83 / "C. Experimental proposal" / "Definition field address conversion operators").

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

XY.3.1 Recognized

translator: subtype of xt, and executes with the following stack effect:

recognized: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )
RECOGNIZED-THING ( j*x i*x state -- k*x )

A translator depends on STATE to translate the given arguments:

A recognized xt acts on the state passed to it on the stack

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND if not.

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

Extensions reference implementation:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

Stacks TBD.

Testing

TBD

ruvavatar of ruv

A recognized xt acts on the state passed to it on the stack

  1. A proper term for "recognized xt" ("recognized execution token") should be chosen. "recognized xt" means "xt that is recognized", but we don't recognize execution tokens, but recognize lexemes. This xt just is a result of recognizing a lexeme. And it should be named according what it does, not according who produces it.

  2. There is no reason to pass state on the stack — we discussed that, and the reference implementation reflect that.

BerndPaysanavatar of BerndPaysan

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE. The reference implementation needs to be adjusted.

For the name of the result values we might want to have another round of bikeshedding. In particular with more native speakers. The current wording represents the last round of bikeshedding.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

recognized: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )

A recognized xt acts on the state passed to it on the stack

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
: recognized-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
: recognized-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
  forth-wordlist find-name-in dup IF  ['] recognized-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
  0. 2swap >number 0= IF  2drop ['] recognized-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Stacks TBD.

Testing

TBD

ruvavatar of ruv

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE.

I see the following in the report by @ulli on 2021-09-08:

Given the two variations to handle STATE (either in RECOGNIZER:'s DOES> part or in INTERPRET), yesterdays participants favoured to have the single occurrence of STATE in INTERPRET. Further investigation and model implementations will show whether on or the other is beneficial.

So it implies further investigation and model implementations.

Could someone provide a rationale in favor to pass state (better say "mode") via the stack?


My rationale against mode on the stack is following:

  1. It makes combination of token translators cumbersome. E.g. a definition : tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ; becomes far more complex.
  2. In most cases a program doesn't need to execute a token translator in a mode that is different from the current mode (counter examples are welcome, except postpone).
  3. The current mode is already held by the system anyway.
  4. (most importantly) It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API. This loop does not need to know anything about modes and STATE at all. If we are replacing the system's lexeme translator (along with the system's set of token translators), we should be able to replace it along with the system's STATE (and the set of the system's modes) too. Moreover, a token translator can technically ignore the passed value and use it's own set of modes. And even such a simpler mode-beyond-stack API can be implemented over one that passes mode via the stack.

On the other hand I don't think that including (mentioning) STATE in a new API is a good choice. STATE returns a read-only address, and it's provided for back compatibility only. So a better method instead of STATE is required anyway.

Actually, the system's token translators are the only ones who depend on the system's set of modes. In most cases user-defined token translators are defined via system's token translators (which should be standardized) and they need to know nothing about system's set of modes, and about STATE at all. In the same time, a user is able to define own set of recognizers and set of token translators that don't depend on system's set of modes, but introduce own set of modes.

So, the specification for Recognizer API should not mention nether STATE nor a set of magic values like {0, -1, -2}.


Concerning your mode -2 — I believe, the standard word postpone doesn't need an own mode. But in postponing mode, if any, string literals (like s" foo bar") and comments should be properly treated.

ruvavatar of ruv

  1. It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API.

See a block-based illustration of this idea in my Gist.

ruvavatar of ruv

: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

In the reference implementation you keep the mode {0,-1,-2} in the state variable. But it's problematic, regardless how the mode is passed into token translators (directly via the stack, or indirectly using a dedicated method).

Since, when interpretation state is set by [ (so state is set to 0), and then compilation state is set by ], the mode should be the same as before [. If it was -2, it should be set to -2. But information that the mode was -2 is lost. So another variable should be used to keep a flag whether "POSTPONE" mode is active or not.

Actually, the mode of compilation/interpretation and "POSTPONE" mode are not mutual-exclusive. They can be set independently of each other.

For example, the code:

: foo postpone bar [ postpone baz ] ;

conceptually can be pretty clear defined (see my comment). In this fragment, for the lexeme bar "POSTPONE" mode is active in compilation state, for baz "POSTPONE" mode is active in interpretation state.

So, if "POSTPONE" mode is employed, a different variable for it should be used for this reason too.


On the other hand I'm not convinced that we need "POSTPONE" mode at all. Except to implement the word postpone itself, where and how this mode can be used? Even for the questionable construct ]] ... [[ the mode "POSTPONE" isn't needed.

BerndPaysanavatar of BerndPaysan

OK, the most convincing argument is that STATE can go away as specified thing. You can use and combine system translators, and you can create table-driven translators, but STATE is an implementation detail.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

XY.3.1 Translator

recognized: subtype of xt, and executes with the following stack effect:

translator: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )
THING-TRANSLATOR ( j*x i*x -- k*x )

A recognized xt acts on the state passed to it on the stack

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: recognized-nt ( nt state -- )
  case
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: recognized-num ( n state -- )
  case
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] recognized-nt  ELSE  drop ['] notfound  THEN ;
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] recognized-num  ELSE  2drop drop ['] notfound  THEN ;
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  2 cells + @ execute ;
: translate-comp ( translate-xt -- )  cell+ @ execute ;
: translate-post ( translate-xt -- )  @ execute ;

Stacks TBD.

Stacks TBD, copy from Trute proposal.

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  2 cells + @ execute ;
: translate-comp ( translate-xt -- )  cell+ @ execute ;
: translate-post ( translate-xt -- )  @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Stacks TBD, copy from Trute proposal.

Stack library

: STACK: ( size "name" -- )
  CREATE 1+ ( size ) CELLS ALLOT
  0 OVER ! \ empty stack
;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize   ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP NOTFOUND
;
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  dup stack: dup cells negate here + set-stack
  DOES>  recognize ; 
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 1+ ( size ) CELLS ALLOT
  0 OVER ! \ empty stack
;
  CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize   ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP NOTFOUND <> IF
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP NOTFOUND
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  dup stack: dup cells negate here + set-stack
  DOES>  recognize ; 
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;
' minimal-recognizer is forth-recognizer
' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtlavatar of AntonErtl

It seems to me that, given the reference implementation

' translate-nt translate-int
' translate-num translate-int
' translate-dnum translate-int

does not work (nor with translate-comp nor translate-post). Assuming you solve this, do you really want me to define, e.g.,

:noname ['] translate-nt translate-int ;

to get an xt equivalent to one of the xts that has been passed to translate:?

How do you implement POSTPONE (IIRC Matthias Trute has a reference implementation for that)?

What problem is solved by making all the translators state-smart? The problem I see is that you can only access the individual actions by saving state, setting state, executing the translator, and restoring the state. That's not a good design.

The specification of translate: mentions a "current mode". Where do I find out what a "mode" is? This is non-standard terminology.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Translate as in interpretation state

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Translate as in compilation state

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Translate as in postpone state

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysan

I removed the access words to the xts for a reason: We don't do it that way in Gforth, and we actually found little use of those words.

There are (at least) three ways to create an interface to these operations:

  1. Field-like, i.e. you can read and write the xts for interpretation/compilation/postpone. The typical usage is ( translator ) TRANSLATE-<state> @ EXECUTE.
  2. Valuefield-like, as in the original Trute proposal. You can read the xts for interpretation/compilation/postpone without an extra @, but you can't write them, unless your access word also implements TO. The typical usage is ( translator ) TRANSLATE-<state> EXECUTE.
  3. Deferfield-like, which is what Gforth does. Here, not only the @ is part of the operation, but also the EXECUTE (really a tail-call variant of it). You can neither read nor write the xts, unless the access words also implement IS and ACTION-OF. The typical usage is ( translator ) TRANSLATE-<state>, and that looks about right. Gforth uses different names to not collide with the proposal here.

Gforth offers as an extension to add more states and thus more access words, and that extension also adds IS, TO (which are synonyms) and ACTION-OF to the existing (it is only one, only for postpone state you need it explicit) access word, and also implements the other two for interpret and compile, which are never used on their own. Of course when you add a new state, you need to specify what existing translators do on that state, so IS becomes necessary, and ACTION-OF just comes for free through Gforth's way of implementing TO and variants, of which ACTION-OF is one. This extension is non-standard, and not proposed here, it is used for creating obscured (“tokenized”) source code and reading name=value-style config files.

The experience so far is that outside of this extension, there's only one of those three access words needed at all, which is TRANSLATE-POSTPONE, and it is exclusively needed inside the standard word POSTPONE itself, a word where the implementation is left up to the system anyways. So the usage of these words is extremely limited. Therefore, I deleted them and suggest not to standardize these words, following the “don't speculate” rule and the topic of this proposal to make a minimalistic API, which contains only what's necessary. These words are of little use, and therefore there's no need to standardize them.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Translates a number.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtlavatar of AntonErtl

Making the translators depend on state is a bad idea. It means that everything using the translators becomes infected with this state-dependency. It also means that you cannot implement postpone or ]]...[[ as standard-compliant code (while, with state-independent translators you could).

Moreover, when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state before executing the translators, which is perverse. And in the case of colorforth-bw, again there is no standard way to set state to get the translator to perform xt-post.

BerndPaysanavatar of BerndPaysan

The experience with the usage in Gforth (non-standard extensions excluded) shows that direct calls to translators with a specific state are limited to postpone, which is compile-only and therefore

: postpone ( "name" -- )
  -2 state ! parse-name forth-recognize execute -1 state ! ; immediate compile-only

is not generating surprises (postpone is expected to leave the system in compilation state after it has done its work). In Gforth, ]] and [[ are implemented by changing state, and for recognizing the super-immediate [[ a special recognizer is added to the stack which returns a translator that has a specific postpone effect that changes back to compilation state and drops the additional recognizer from the stack.

' noop dup :noname  ] forth-recognizer stack> drop ; translate: translate-[[

The state-dependent invocation is the 99.9% case for translators, and that includes ]] and [[.

The Forth outer interpreter depends on state (or a similar internal representation). The object that deals with the different actions depending on state is the translator.

The proposal allows you to implement other ways to access the individual methods of a translator, if you need them. It does not encourage anymore to use translators as building blocks for other translators, and we can add wording that only translators created by translate: are standard-conforming. Since there's little use for these other access methods, it does not suggest to standardize those.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

ruvavatar of ruv

Anton writes:

Making the translators depend on state is a bad idea.

It's a subject of terminology. A translator depends on state by definition:

to translate a token: to interpret the token if interpreting, or to compile the token if compiling.

token translator: a Forth definition that translates a token; also, depending on context, the execution token for this Forth definition.


If you want a recognizer to return not a execution token, but some opaque identifier, I suggest to call it "descriptor".

token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.

token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value (i.e., a constant), or a token descriptor object itself.

BerndPaysanavatar of BerndPaysan

As said, this is just about moving things around. There's little difference if you use translator-execute or execute on a translator as specified way to get from the translator to its state-dependent action. It's all system-dependent and hidden, and systems might implement it without even referring to STATE and only update STATE to reflect compilation and interpretation state and otherwise never look at it, and the way the system internally keeps its state can be completely different. The most obvious difference is that with translator-execute, you need another word.

The fact that some abstract data type is executable does not mean EXECUTE is the only way to operate on it. Recognizer sequences are executable in this proposal, and they still can be read out with and set by GET/SET-RECOGNIZER-SEQUENCE. So you can't just define them as colon definitions, you need to go through RECOGNIZER-SEQUENCE: to define them.

Though I don't propose to standardize this, the proposal also suggests to make word list ids executable, and put them together in a recognizer sequence called search-order. word list ids still have be used in other ways, e.g. to add new words to them, and the details are left to the system; but it is clear that they can't be normal colon definitions.

Not providing an abstraction like either translator-execute or execute, and instead putting it directly as state @ abs cells + @ execute into the outer interpreter is a really bad idea, because all details of the reference implementation in which this sequence works become then part of the standard. Other ways to implement it, which may have performance advantages, or not expose the postpone state in STATE would then not be allowed. The following implementation should be standard, too:

: do-translate ( translate-body -- ) 0 + @ execute ;
: state! ( state -- ) dup state ! abs cells ['] do-translate >body cell+ ! ; \ assume threaded code
: translate: ( int-xt comp-xt post-xt "name" - - )
  create swap rot , , ,
  does> do-translate ;
: [ 0 state! ; immediate
: ] -1 state! ;
: ]] 2 cells ['] do-translate >body cell+ ! ; immediate \ STATE left as is

How to recognize [[ is left as exercise to the reader, hint: a recognizer is a good idea, because it actually provides something that is executed at postpone time.

AntonErtlavatar of AntonErtl

By comparison, with the first version of this proposal postpone can be implemented like this:

: postpone parse-name forth-recognize -2 swap execute ; immediate

which would not contain non-standard usage like -2 state !, and it would also work in interpret state (not the most important feature, but a feature nonetheless). And ]] could also be implemented as a standard program.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters. Other uses may be rare, but they exist, and people may come up with more over time if we make the interface flexible enough. The proposal does not propose to standardize state-independent ways to get at the functionality. Therefore, if the proposal is accepted, they don't exist for standard programs, and therefore they are not counterarguments against the disadvantages of the proposed state-dependent-only translators. The fact that this state-dependence means that you cannot use rectypes/translators to build other rectypes/translators is another (minor) argument against the state-dependence.

Concerning having a state-independent rectype as an abstract data type, the first version of this proposal proposed that rectype is an executable word with stack effect ( i*x state -- j*x ) where state would be 0, -1, or -2. This does not expose anything about the internals, and even allows to define rectypes without using a special defining word. The invocation in the text interpreter is ( i*x rectype ) state @ swap execute, and in postpone it´s as shown above.

Alternatively, if the rectype is the address of some data structure, yes, we would need an additional word, maybe rectype-translate ( i*x rectype n -- j*x ) that performs the access to the data structure. The usage in the text interpreter would be ( i*x rectype ) state @ rectype-translate and the usage in postpone would be ( i*x rectype ) -2 rectype-translate.

ruvavatar of ruv

Bernd writes:

The most obvious difference is that with translator-execute, you need another word.

Yes, essentially I agree concerning translator-execute and execute alternatives.

Yet another difference is that with translator-execute the Forth text interpreter (the outer loop) should know this additional word (probably it means more degree of coupling). But with execute — it should not know any additional word.

to make word list ids executable [...] but it is clear that they can't be normal colon definitions

Another example is defer-words (words created by defer), which are executable but are not normal colon definitions — defer! and defer@ can be applied to their xt.

The following implementation should be standard, too

The provided implementation for ]] is system dependent, namely it depends on implementation of Recognizers API. But, anyway, Gforth's ]] can be implemented in a standard way via postpone.

BerndPaysanavatar of BerndPaysan

A translator is the address of a data structure, which also happens to be executable. This is not a contradiction! And there was a proposed standard way to access fields directly, renamed from the Trute proposal (but with otherwise identical, value-field like semantics) to INTERPRET-TRANSLATOR, COMPILE-TRANSLATOR, and POSTPONE-TRANSLATOR. The reason I deleted these is that we don't even use them in Gforth, we only use >POSTPONE, which has a different effect (it does not read out the xts, it executes it right away). If there is consensus that this is the right interface (not a value-field, but a defer-field), I can add this back to the proposal; as well as adding a standard way to set the state without knowing the internals of the system, for which the file recognizer-ext.fs in Gforth also provides a suggestion:

: translate-state ( translator-access-xt -- )
    \ takes a translator access xt, and may check if that actually is one
    >body @ cell/ negate state ! ;

The hypothetical more performant implementation in Reply 1043 would have a different translate-state, which would contain something like

>body @ ['] do-translate >body cell+ !

and only change STATE for interpret/compile.

This proposal is minimalistic on purpose and does not cover all corner cases, especially not those where no consensus has been reached yet.

I consider the magic number dispatch method proposed earlier as not appropriate: this is tied to a specific implementation, and not a good interface. Method invocation or field access should be done by named access words, not by numbers.

ruvavatar of ruv

Anton writes:

postpone can be implemented like this

postpone can be implemented in any variant of the Recognizer API, with more or less code.

A difference is whether the behavior of postpone can be extended/changed without redefinition of postpone.

My point: if users need to extended behavior of postpone without redefinition, then a special method can be specified for that. OTOH, postpone (and ]]) is a poor man's "postponing mode". An example of a more convenient tool is my c-state PoC, which provides a better tool for users, and it even supports any new user-defined special words.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters.

It's not an argument, since the API can provide words like compile-token, execute-token, postpone-token, having ( i*x xt.translator -- j*x ) or ``( ix rectype -- jx )`, which are state-independent and don't restrict usage in the mentioned way.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-15 Add list of example recognizers and their names.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words: * ->value as TO value or IS value * +>value as +TO value * '>value as ADDR value * @>value as ACTION-OF value xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-15 Add list of example recognizers and their names.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words:

* ``->``*value* as ``TO ``*value* or ``IS ``*value*
* ``+>``*value* as ``+TO ``*value*
* ``'>``*value* as ``ADDR ``*value*
* ``@>``*value* as ``ACTION-OF ``*value*
  • ->value as TO value or IS value
  • +>value as +TO value
  • '>value as ADDR value
  • @>value as ACTION-OF value

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

Testing

TBD

BerndPaysanavatar of BerndPaysan

Things to discuss, because there are still too many variables.

ToDo:

  • Rename Recognizers from REC-result to RECOGNIZE-result. A solution for .RECOGNIZERS drowning the reader in recognize- could be to skip that prefix, because all recognizers are supposed to have the same prefix, anyways.
  • Revert the name of translators to rectypes or some similar word showing that this does describe a type?
  • Add mode/state-specific access words to the translators again and decide on how they work. I prefer defer-field likes, which right away execute the corresponding action, and not put an xt on the stack for consumption. Defer-fields could work together with IS and ACTION-OF to access the xts within (in Gforth, they do).

Answers to some questions:

A lot of thoughts went into it to make different subsets of this proposal useful on their own, and allow different implementation strategies. The answer to “can I do without feature X” is most likely yes. You can use the subset of the features you want. Stripping away too much results in a subset no longer usable.

  • Opening up the whole idea to small systems is useful to gain wider use.
  • FORTH-RECOGNIZE is a deferred word in the reference implementation on purpose, and that allows changing it without adding more words. To add more implementation options, you can use the setter and getter words (which are optional) if you don't want to implement it as deferred word to swap in and out named sequences.
  • The recognizer sequences do have words to get and set the sequence, so you can just work with a single sequence and set/get it if you like. The nesting capability comes by the magical fact that a recognizer sequence has the same stack effect as a recognizer.
  • You can do without both, because recognizer sequences can be written as colon definitions “by foot”.
  • Named sequences are useful, especially when you swap in recognizer sequences for applications that do something completely different than the Forth recognizer sequence. If you do not want to support named sequences, you can still provide the one single named sequence FORTH-RECOGNIZE, and allow SET-RECOGNIZER-SEQUENCE and GET-RECOGNIZER-SEQUENCE to operate just on that. That's also an option where recognizers are useful without having FORTH-RECOGNIZE being deferred and no RECOGNIZER-SEQUENCE:.
  • The NOTFOUND return for failure is there so that you can always EXECUTE the result of FORTH-RECOGNIZE and don't have to check for errors there.

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later. Actually, parsing should happen in PARSE-NAME. It still seems to be a hack that doesn't have a perfect solution.

ruvavatar of ruv

Rename Recognizers from REC-result to RECOGNIZE-result

In general, an abbreviation or acronym may be acceptable to me. But in this case I prefer RECOGNIZE- rather than REC-. The main disadvantage of rec if that it has misleading associations. And the main advantage of recognize is that it's a whole English word that is very appropriate for our case.

The part referred as "result" should not be a result (of recognizing), but the expected type of the input lexeme. Have a look in your examples — REC-NUM and REC-TICK produce the same result type translate-num, but they accept different types of input lexemes, and these types are identified by NUM and TICK symbols correspondingly.

Thus, the naming form for recognizers can be expressed as RECOGNIZE-{lexeme-type-symbol}.


Revert the name of translators to rectypes or some similar word showing that this does describe a type?

It does describe a type of what? It describes a type of a token i*x, which is a result of recognizing. Actually, a token translator identifies the type of a token i*x, which is a result of recognizing. Then, a token translator is a token type in the same time.

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type. Then, token translators can be named according to the form TT-{token-type-symbol}. It looks elegant to me.

The names of translators are used for two purposes: to call a translator (for example, when we define a new translator via existing translators), and to obtain xt of a translator (which is an identifier for a token type in the same time) — to analyze a result of recognizing. The prefix tt- looks good in these both case.

ruvavatar of ruv

An example of use translators for two different purposes:

\ use "tt-lit" and "tt-2lit" just to call these token translators:

: tt-3lit ( 3*x -- 3*x | )
  >r tt-2lit  r> tt-lit
;

: recognize-forth-lexeme ( sd -- i*x tt ) forth-recognizer execute ;


\ use "tt-xt" to analyze a token type:

: recognize-tick ( sd -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  ['] recognize-forth-lexeme execute-balance2 ( i*x tt|0 n.data-stack n.float-stack )
  2>r dup ['] tt-xt = if 2rdrop exit then drop 2r> fndrop ndrop 0
;

In this implementation for recognize-tick (not tested), the phrase 'foo::bar::baz will work correctly and returns xt of the word baz in the wordlist bar in the wordlist foo, when recognize-pqname for the syntax "::" (example) is a part of forth-recognizer.

To implement this, we do a nesting call of the forth recognizer for another lexeme and then analyze the returned type. If the returned type is not appropriate, we drop the token (from the data stack, and from the floating-point stack, if any). So we need to be sure that calling recognize-forth-lexeme never causes any side effect (other than stacks), even when recognizing succeeds.

NB: when recognize-tick is a part of the current forth-recognizer, executing of recognize-forth-lexeme on some inputs will produce indirectly recursive call of recognize-forth-lexeme (as intended).

ruvavatar of ruv

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later.

  1. It's pretty allowed for a translator to parse the input buffer and/or read the input stream. Some token translators will even do nesting calls of the Forth text interpreter and can throw exceptions.

  2. A problem that a part of the string can be in the input buffer (or even in the input stream) is solved via introducing two translators for strings: one accepts the full string from the stack (e.g. tt-slit), and another (e.g. tt-slit-parsing) accepts the starting part from the stack, and the tail from the input buffer (or input stream). The string recognizer returns one or another depending whether a lexeme is a completed string, or the start of the string only.

I published a reference implementation in 2019, and now updated it for the current proposal.

A string recognizer can be as follows:

: quot ( -- sd.quot ) s\" \"" ;

: recognize-string ( sd.lexeme -- sd tt.slit|tt.slit-parsing | 0 )
  quot match-head 0= if 2drop 0 exit then quot match-tail if ['] tt-slit exit then
  2dup quot contains if 2drop 0 exit then \ fail if '"' is found in the middle of the string
  ['] tt-slit-parsing
;

BerndPaysanavatar of BerndPaysan

The code which I have simply looks like this:

['] translate-string of  json-string!           endof

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

Thinking a bit more about that, I found:

  1. The thing you want to nest is the translator for names
  2. As I said, names should be first, numbers second and the rest third
  3. We have nestable recognizer sequences

So one solution would be to put all recognizers that return nts+translate-nt (or variants of that, e.g. locals have a variant of translate-nt that differs for postpone) in one recognizer stack, which has a name, and can be called without calling the entire recognizer stack. These recognizers have now a predictable effect, and no side effect. Since you can't tick locals, you still have to check for translate-nt, but that's ok. You don't have to go through all weird other recognizers.

In Gforth, .recognizers now can handle and display nested recognizers, and if you split this up like that, it would output:

.recognizers  ~names ( ~nt ( Forth Forth Root ) ~scope ) ~numbers ( ~num ~float ) ~others ( ~string ~to ~dtick ~tick ~body ~complex ~env ~meta )

The ~ is there to abbreviate recognize- (or rec- now).

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

The other solution is what Gforth does: There's a ?REC-NT which does the nesting, the checking for translate-nt, and the cleaning up of the side effects (stacks and >IN). There is the possibility to make this more generic, e.g. create a word TRY-RECOGNIZE which gets an xt, passes that to the result, and if that returns false, everything is cleaned up and false is returned, otherwise whatever that xt left (including the flag) is returned.

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

ruvavatar of ruv

One correction. I wrote:

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type.

It should be read as:

If we want to reflect this idea, we can use the acronym tt, which stands for both: "translate token" (verb) and "token type" (noun).

Data type symbol

To specify formal requirements, we have to introduce a new data type for token translators, which is a subtype of xt. And the abbreviation tt is a good candidate for this data type symbol.

If we will have the data type tt => xt|0, and the symbol sd for the string data type, the naming convention along with the stack diagram for a recognizer can be expressed as:

RECOGNIZE-{lexeme-type-symbol} ( sd.lexeme -- i*x tt ) ( F: -- j*r )

ruvavatar of ruv

@BerndPaysan writes:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases but single-line string literals. But it's impossible.

For example, a recognizer for multi-line string literals cannot parse the full string literal without refilling the source (see my PoC implementation). Should we also restore the input source state to isolate side effects of recognizers?

And it still isn't enough. A recognizer for curly-based markup like foo{ any forth code bar{ nested code }bar ... }foo cannot return something useful, since a useful thing in this case is a created definition or just a side effect of appending some semantics to the current definition. Should we also restore the state of the dictionary?

I think, it's obvious — isolation of all possible side effects of recognizers is not fruitful.

Yes, some recognizers returns objects that are not useful by themself, but they still return information what a given lexeme means, and it's an acceptable price for absent side effects for all recognizers.

Also we separate concerns into things that do have side effects (token translators) and things that don't have side effects (recognizers). It's very useful separation.

ruvavatar of ruv

@BerndPaysan writes:

The code which I have simply looks like this:

['] translate-string of  json-string!           endof

A straightforward solution is to handle each token type of string literals separately. Probably, I would write it as follows:

  'tt-slit           of                  json-string!    endof
  'tt-slit-parsing   of  parse-slit-end  json-string!    endof
  'tt-slit-ml        of  parse-slit-ml   json-string!    endof

(I would use a recognizer for a leading tick, and naming of translators in the form tt-{token-type-symbol})

Or I would factor a helper word as follows:

: ?prepare-tt-slit ( i*x tt -- i*x tt | sd.transient tt.slit )
  case
    'tt-slit           of                  'tt-slit endof
    'tt-slit-parsing   of  parse-slit-end  'tt-slit endof
    'tt-slit-ml        of  parse-slit-ml   'tt-slit endof
  endcase
;

: eval-json ( .. tag -- )
  ?prepare-tt-slit case
    ...
    'tt-slit           of                  json-string!    endof
   ...
  endcase
;

ruvavatar of ruv

Multiple entry points for the Forth recognizer

@BerndPaysan writes:

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

Yes, I also consider such a solution. It's a convenient solution to implement the default Forth recognizer.

But requiring the Forth recognizer to always conform this particular structure of recognizer sequences, and even always be the same instance of this structure, is too restrictive.

And otherwise you don't know the id of the actual names recognizer sequence (and even don't know whether such a sequence exists), and so you cannot check a lexeme against only this sequence (I mean, in implementation of recognize-tick).

Filter recognizer results

Bernd, your word try-recognize is a good factor to filter results, regardless side effects (beyond stacks). Having recognizers without side effects, it can be also implemented in a portable way.

If this word filters for a single token type, it's better to pass a corresponding tt directly (instead of xt.filter).

If this word allows to filter for multiple token types (I assume this variant), it should not drop tt from the stack.

Also, to be more useful, this word should not be bound to the current Forth recognizer only. Then, this word can be called as

apply-recognizer-filter ( sd.lexeme xt.recognizer xt.filter -- i*x tt | 0 )`.

A usage example:

: recognize-forth-name ( sd.lexeme -- nt tt.nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter
;

: find-forth-name ( sd.lexeme -- nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  if exit then 0
;

: find-forth-name? ( sd.lexeme -- nt true | false )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  0<>
;

: recognize-tick ( sd.lexeme -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  forth-recognizer [: dup 'tt-xt = ;] apply-recognizer-filter
;

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

Yes, but, as I show, >in is not enough. Also, it's better to avoid such special cases in general.

ruvavatar of ruv

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

There is no way for a program to check whether it can apply TO to FORTH-RECOGNIZER, or FORTH-RECOGNIZE, or RECOGNIZE-FORTH-LEXEME, etc. Thus, TO cannot be optional. And it cannot be mandatory too. Thus, TO cannot be a part of the API at all — neither RECOGNIZER, nor RECOGNIZER EXT.

Then the getter and setter should be a mandatory part of the API.

ruvavatar of ruv

In continuation to the message:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases

Another example of an unuseful token is the result of the mentioned recognizer REC-TO, which recognizes a syntax like ->foo.

It's too restrictive to require this token be ( xt n tt ), since in some systems it can be just ( xt.set-value tt ), in other — ( addr.data-field xt.store tt ). This means that this token is not something useful to a program at all (apart of translation).

GeraldWodniavatar of GeraldWodni

The committee thanks the authors for all the work. Here is the timetable:

  • Everybody interested in this proposal: please submit your comments by end of October.
  • Bernd (main author): please work this into a new version by the end of the year (2024).
  • The committee will have a special interim meeting for this very proposal in February (final date will be announced in mattermost)
Considered

BerndPaysanavatar of BerndPaysan

Concerning the setters and getters: I would prefer to make it mandatory that FORTH-RECOGNIZE actually is a deferred word, and drop the additional getters and setters completely. DEFER, IS, and ACTION-OF are all CORE EXT; so if you implement the recognizers, you have a dependency on those. The previous proposals had VALUE and TO and interface, which is also CORE EXT.

Gforth could support IS and ACTION-OF on recognizer sequences, too (i.e. assign n elements in order), through its polymorphous approach at all those words for value-style words (TO, +TO, ADDR, IS, ACTION-OF all can do different things on different classes of values), but I guess that would be too much.

Can those setters and getters be optional in case you don't want to support DEFER, and how can a program be written to work in both cases? If you have TOOLS EXT available, you can use

[DEFINED] is [IF]
    is forth-recognize
[ELSE]
    [DEFINED] to [DEFINED] forth-recognizer and [IF]
        to forth-recognizer
    [ELSE]
        set-forth-recognizer
    [THEN]
[THEN]

Yes, this is ugly and shows that having different options is not a good idea.

For the reworked proposal, I will need to restructure the proposal in a way that optional parts I rather want to remove are outlined as such, so that the final rewrite is easy.

ruvavatar of ruv

Deferred words in API considered harmful

make it mandatory that FORTH-RECOGNIZE actually is a deferred word

As we have discussed, the main problem with a deferred word is that it can't be redefined by wrappers that have additional actions when setting or getting the value. In this respect, such a word in an API is as bad as an address-flavoured variable (like BASE).

There is also a recent discussion in comp.lang.forth (link) under subjects "value-flavoured approach" and "value-flavoured structures".

Special data object on failure considered harmful

A question is what to return on failure (unsuccess): a special data object (xt of notfound) or a common data object 0 (zero).

Below is a copy of my rationale from 2023, with some rewording.

There are two strong arguments against a special data object:

  • consistency with other similar words;
  • impact on the overall lexical size of programs.

Consistency

Many standard words returns some data object on success, or 0 (zero) on unsuccess/failure. This is possible because this data object cannot be 0.

For example:

  • name>interpret ( nt -- xt | 0 )
  • find-name ( sd.name -- nt | 0 )
  • find-name-in ( sd.name wid -- nt | 0 )
  • find ( c-addr -- xt n | c-addr 0 )
  • search-wordlist ( sd.name -- xt n | 0 )
  • source-id ( -- fileid | -1 | 0 ) — not a fail, but also an example when zero was chosen instead of a special object.

Also, it is a common approach in practice. This allows common high-order functions operates on the common failure result 0.

Why should not recognizers follow this practice? Why should they return a special id on failure rather than zero?

Lexical code size

Returning notfound on failure makes the code shorter (in terms of lexemes) in some places. But the point is that it makes code longer in more places.

I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of a Recognizer API. In its code:

  • ['] notfound with = or <> is used 10 times, and without checking — 32 times.
  • forth-recognize execute is used 3 times.

If we use 0 (zero) instead of the notfound xt, then:

  • ['] notfound <> is removed 5 times, which eliminates 15 lexemes;
  • ['] notfound = is replaced with 0<> 5 times, which eliminates 10 lexemes;
  • ['] notfound is replaced with 0 32 times, which eliminates 32 lexemes;
  • the definition for notfound is removed, a definition for ?found is added: : ?found ( x.some\0 -- x.some | 0 -- never ) dup 0= -13 and throw ;, which adds not more than +3 lexemes;
  • forth-recognize execute is replaced with forth-recognize ?found execute 3 times, which adds +3 lexemes;
  • the word ?found can be also used after find, search-wordlist, find-name, find-name-in — when the user needs to execute their result at once, and unsuccess should produce an exception.

Thus, replacing of notfound by zero reduces the overall lexical code size in Gforth by more than 51 lexemes, which is more than 0.4KiB in absolute size (as on 2023-09-17).

So why should we prefer an approach that increases the overall lexical size of programs?

AntonErtlavatar of AntonErtl

About the proposal text

The "Problem" section does not describe a problem of Forth-2012 that the proposal wants to solve, but considers a problem with some other recognizer proposal. Similarly, the "Solution" section refers to some other recognizer proposal. This makes these sections useless for readers who have not first read up on the other proposal, which is not even linked here. Parts of the "Solution" section might be useful in another section on transitioning from the earlier proposal.

Instead, the "Problem" and "Solution" sections should describe what benefits this proposal adds to the standard, and how. A possible "Discussion" section and its subsections should describe the benefits of the present approach over possible alternative approaches (if that's too detailed, lazy system implementors will complain about the length of the proposal, but some complaints should just be ignored).

"Typical use" should of course be presented.

State-dependence

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words. This makes the translators hard to use anywhere except in INTERPRET; the proposed-for-standard interface is even hard (actually impossible with standard means) to use in POSTPONE, which is an intended user of translators, as the proposal admits itself:

POSTPONE can do that without a standardized way

Another problem with the state-dependent translators is that it leads to either handwaving specifications of what they do, as evidenced in XY.3.1:

TRANSLATE-THING ( jx ix -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

in the non-specification of what translator-xt does in FORTH-RECOGNIZE and the handwaving specification of "name:" in TRANSLATE:

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

and the nonspecification of what TRANSLATE-NT, TRANSLATE-NUM, TRANSLATE-DNUM, TRANSLATE-FLOAT, and TRANSLATE-STRING do.

Or if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

If you really believe that the state-dependent approach is a good idea, please specify all these words exactly; the editor won't do it for you.

Opaque solution

If there is no need to make POSTPONE implementable in a standardized way, there is no need to make INTERPRET (which is not even standardized) implementable in a standardized way, either, and the translators can become a completely opaque thing that the standard does not document. In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token, and standard programs can only use that for implementing recognizers, but not for implementing text interpreters, POSTPONE, or anything else.

Transparent solutions

Alternatively, we might heed "Don't bury your tools!" and have a more useful interface for translators, like what we have seen in earlier drafts and other recognizer proposals.

POSTPONE

If the idea of the proposal is that xt-post is actually used by POSTPONE, the proposal should specify the change to POSTPONE.

Standardize recognizers

I expect that more people will want to compose existing recognizers into recognizer sequences than to define new recognizers, but they usually need to know about existing recognizers in order to do that. Therefore the proposal (or an accompanying proposal) should not just propose standard translators, but also standard recognizers.

ruvavatar of ruv

@AntonErtl writes:

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words.

I do not like TRANSLATE: either, but for a different reason. Sometimes it is very convenient to define a translator as a quotation (right inside the recognizer), and if you are forced to define a translator only with TRANSLATE:, you cannot define it as a quotation.

This makes the translators hard to use anywhere except in INTERPRET;

Could you provide some examples, please? It seems, this is not harder than performing the observable interpretation semantics using the result of name>interpret.

BerndPaysanavatar of BerndPaysan

Concerning explicit access methods to xt-int/xt-comp/xt-post, I can offer the following compromise, as a result of observations made:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

However, it turns out that you can access xt-post that way, because the only word that possibly changes that state is [[, and that token is a) no visible at all to POSTPONE, and b) changes the state back to compilation state, the state POSTPONE was in anyhow.

So if your system allows full explicit access to all three possible states, all translators have to be defined by 'TRANSLATE:', and I can offer you three access methods. If you only want to implement POSTPONE, the following definition actually works:

: postpone ( "string" -- )
  parse-name forth-recognize ?found
  state @ >r -2 state ! execute r> state ! ; immediate

Further observations:

Gforth has >INTERPRET and >COMPILE, and doesn't use them, only >POSTPONE is used. In exactly one place, in POSTPONE. All other invocations are through EXECUTE or only taking the data. The rest is implementation, including the extension towards more of those access methods for more, user-defined states. The question is whether you need to standardize a tool that has no use case, even if you don't bury it.

A possible way to deal with this is to move this out to a separate proposal.

What has been quite useful is the EXECUTE interface for user-written interpreters, because these are interpret-only, and don't need the complication of state-dependent translators at all.

BerndPaysanavatar of BerndPaysan

Sleeping over it added a few ideas:

The invocation through changing STATE and restoring it works (in general) for translators that will definitely not change STATE as part of their own operation, e.g. translators for literals. It also works (as a special case) for POSTPONE, so a standard implementation of POSTPONE using that method is possible. The postpone mode itself, which needs to change STATE at [[ relies on the dispatch through STATE without setting and restoring STATE around the invocation, so it also works.

The question here is not if that implementation is a quality implementation, but whether it's not so bad that it is another bag full of inconsistencies. IMHO, TRANSLATE-NT will have demonstrable inconsistencies when not using the clean TRANSLATE: interface, but combined literal translators won't. For the cleaner interface outside of POSTPONE itself (which is special case enough to not require the cleaner interface), we have to demonstrate that there is an actual use case. So far, we don't have one.

Both POSTPONE with the additional functionality and the postpone mode ]][[ will become part of the proposal.

ruvavatar of ruv

@BerndPaysan writes:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

This is wrong. Yes, the state change can be a desired result of interpretation or compilation semantics, but this does not prevent us from performing the interpretation or compilation semantics regardless the initial value of STATE, as I shown many times.

We can use the following helpers for that.

\ Useful factors
: compilation ( -- flag )  state @ 0<> ;
: enter-compilation ( comp: false -- true  |  comp: true -- true )  ] ;
: leave-compilation ( comp: true -- false  |  comp: false -- false )  postpone [ ;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in interpreted state.
: execute-interpreting ( i*x xt -- j*x )
  compilation 0= if execute exit then
  leave-compilation execute enter-compilation
;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in compilation state.
: execute-compiling ( i*x xt -- j*x )
  compilation if execute exit then
  enter-compilation execute leave-compilation
;

If we have a result of recognizing with the xt of a translator at the top (i.e., a fully qualified token), and we want to perform the corresponding interpretation semantics regardless of the current value of STATE, we should execute this xt with execute-interpret. If we want to perform the corresponding compilation semantics regardless of the current value of STATE, we should execute it with execute-compiling. If we want to perform the semantics according to STATE, we should just execute this xt with execute.

The key point in the implementation of execute-interpreting and execute-compiling is that we do not save/restore STATE if it matches the semantics we want to perform — and if changing STATE is part of the semantics, STATE will be changed. On the other hand, if STATE does not match the semantics we want to perform, we change STATE and then restore it — if changing STATE is part of the semantics, then it will change STATE to the same value that was saved and to one we restore it to. Thus, the resulting STATE will be as expected!

NB: execute-interpreting and execute-compiling are also required if we want to perform the interpretation semantics or compilation semantics from an nt, regardless the current value of STATE. Moreover, these words are required even in the old approach for Recognizer API, which provides the words RECTYPE>INT and RECTYPE>COMP — because these words have the same flaw for state-dependent words as NAME>INTERPRET and NAME>COMPILE.

ruvavatar of ruv

@AntonErtl writes about token translators:

if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

This is reasonable. And we also discussed in the Recognizer chat group that the standard does not imply such a state as postponing (for the Forth text interpreter).

In my opinion, these problems can be avoided.

  1. We should specify "to translate a token" and "token translator" in the common sections of term definitions, data types and usage requirements. Then, we do not need to repeat that for every token translator. It will be enough to specify that a word is a token translator, and the data type of the token (that it translates).

  2. We can have a word like postpone-token ( qt -- ) that append the compilation semantics of a lexeme, which was recognized as qt, to the current definition. (qt is a qualified token, which is a pair of an unqualified token and token translator ( uq tt ))

So, any additional state, if any, is encapsulated into postpone-token. The standard should not specify it.

Thus, postpone can be defined like this (in my parlance):

: postpone ( "name" -- )
  parse-lexeme perceive ?found postpone-token
;

How postpone-token finds/performs the postponing action from tt — it's an internal problem of implementation. The word postpone-token should throw the exception -32 "invalid name argument" if a postponing action is not associated with tt.

We need to provide a way to associate a postponing action (an xt) with a tt, or to create a new tt from an xt and tt. The postponing action should be optional. The user needs to provide a postponing action only if they want to make postpone applicable to the corresponding lexemes.

For example, we can have an optional word postponable ( tt1 xt.postponing -- tt2 ). Probably, this word shall return the same tt2 for the same input pair ( tt1 xt.postponing ). This word is optional, because it can be implemented along with postpone-token in a standard program, and postpone can be redefined to use then.

ruvavatar of ruv

@AntonErtl writes:

In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token,

I researched this approach.

In general, a recognizer returns a qualified token (qt) on success, where a qualified token is a pair of an unqualified token (uq) and a token descriptor (td).

Data type relations:

  • unqualified token: ut => ( S: i*x F: k*r )
  • token descriptor: td => x\0
  • qualified token: qt => ( ut td )

It is always possible to define a word translate-qtoken ( any qt -- any ), which translates a qualified token (i.e., performs the interpretation or compilation semantics for the corresponding recognized lexeme depending on STATE). And as practice shows, it is very useful and in demand.

Additionally, in Forth, it is always technically possible to make the token descriptor also a token translator (that is a subtype of the execution token), without any loss (see an example).

  • token translator: tt => xt ; td = tt

So, instead of using a separate word translate-qtoken, we can use the word execute. And the Forth text interpreter simply executes the token translator (instead of applying translate-qtoken to qt). Note that regardless whether the token descriptor is a subtype of the execution token, the token descriptor is opaque for the Forth text interpreter. The only difference is whether translate-qtoken or execute is using by the Forth text interpreter.

The big advantage of token translators is that they can be defined inline as quotations, and they can be used to define other token translators. This simplifies programs and reduces the lexical size of programs.

Also, token translators allow us to define dual-semantics words simpler. For example, this is a definition for ['], which has the expected interpretation semantics:

: ['] ( -- xt | ) '  tt-xt ; immediate

See also in my gist the word missing(, which has the expected interpretation and compilation semantics. Without token translators such words are more difficult to implement.

AntonErtlavatar of AntonErtl

Concerning the supposed lack of use cases: I have mentioned use cases where the state-based interface is at the very least cumbersome in r1038.

state is a bad idea, as demonstrated by the problems mentioned above. We are stuck with it for the existing system, but we must not put state in new interfaces, much less in new defining words.

As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface. And it must not be state-dependent.

ruvavatar of ruv

Use cases

Concerning the supposed lack of use cases: I have mentioned use cases where the state-based interface is at the very least cumbersome in r1038.

Do you mean this example: "you cannot implement postpone or ]]...[[ as standard-compliant code"? Then it's unclear what specification for these words you cannot implement? Because:

  1. You provided a portable (standard compliant) implementation for ]] ... [[ (based on postpone). This implementation does not depend on how postpone is implemented.
  2. A standard postpone can be implemented using find or find-name. An advanced postpone can be also implemented in a standard-compliant way.

Could you please clarify?

Probably you mean that the user should be able to create a recognizer and assign it to the perceptor, and then postpone (and ]] ... [[ that uses this postpone) should be applicable to lexemes that this recognizer recognizes. But I do not see any connection to the state-based interface too.


state is a bad idea, as demonstrated by the problems mentioned above.

I implemented postpone in four different approaches (see fep-recognizer/implementation/variant.gamma/postpone/index.fth) in my "gamma" reference implementation for Recognizer API.

This reference implementation is portable and can be loaded in Gforth as

gforth implementation/index.fth

In every approach I defined the interpretation semantics for postpone, so postpone depends on state. In every approach the words compile-postpone-qtoken ( qt -- ) and translate-postpone-qtoken ( any qt -- any ) are provided. The former does not depend on state, the later does depend on state.

In the variant postpone/auto.via-mmode.fth the macro-compilation mode is employed (one more state, if you like). By default, namely this variant is loaded in the current version (Commit f3b7d01). The macro-compilation mode is very useful because it also allows to implement a more useful and advanced variant than your construct ]] ... [[.

Could you please demonstrate a problem concerning state-dependency in any of these approaches?


As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface.

Could you please provide a practical example when you need a transparent token descriptor structure?

BerndPaysanavatar of BerndPaysan

Using EXECUTE instead of a special translator-specific word allows to use the rest of the recognizer API for interpreters that don't have any state at all. This actually happens and is useful; e.g. the parser in net2o's chat system uses that. There's absolutely no need for any other mode than directly interpreting. And using EXECUTE does not mean you have to set STATE if you call a translator for a particular state (interpreting/compiling/postponing) directly. Though there are likely confusing results if you do so and the word executed is a state-smart word. The amount of surprise level is likely small, because so far, the only direct access method actually useful is the one for the postpone action. And that never executes the word found.

I don't want to mandate a particular implementation. Choose the implementation you like. I'll add an API that allows direct and default invocation of a translator. I'm not sure if I want this in the same proposal or split it into another one, so we can vote on those separately.

ruvavatar of ruv

@AntonErtl wrote in r1038

when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state before executing the translators, which is perverse.

It is not more perverse than repeating «_», «]», or «[» before each lexeme in a program ;)

In general, when the Recognizer word set is provided, the Forth text interpreter itself knows nothing about STATE. If you write a state-independent text interpreter, your recognizers should provide state-independent token translators. And you have not to set state at all. I rewrote your colorforth-bw example in Recognizer API. It just works. Note how translators are embedded into recognizers using quotations (in Commit 6c72064).

And in the case of colorforth-bw, again there is no standard way to set state to get the translator to perform xt-post.

In this approach, why do you need to write «[postpone _foo» instead of «]foo» ?

AntonErtlavatar of AntonErtl

@BerndPaysan:

If you eliminate the state-dependence of translators, then text interpreters that use more than just the xt-int action (e.g., the one for colorforh-bw, see below) can be written without having to deal with state. And text interpreters that use xt-post can be written using the proposed wordset rather than having to use a detour through postpone (which is a parsing word, possibly introducing additional complications).

The following is also relevant to @ruv:

Ruv's colorforth-bw implementation demonstrates the shortcomings of the present proposal, because it does not use recognizers nor translators at all for implementing recognize-colorforth-bw; instead, it reimplements everything that the name recognizer and the number recognizer already do internally, nicely demonstrating that the present proposal buries the tools. And it only implements dealing with names and single-cell numbers. Finally, the implementation is so long (44 lines without putting it into forth-recognize) that you have not shown it inline, but posted a link to github.

By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):

defer recognizer1 forth-recognizer is recognizer1

: prefix>index ( c -- n )
  case
    '[' of  0 endof
    '_' of -1 endof
    ']' of -2 endof
    1 swap
  endcase ;
  
: rectype-colorforth-bw ( ... rectype index state -- ... )
  drop \ we use index, not the surrounding Forth interpreter's state
  swap execute ;

: recognize-colorforth-bw ( c-addr u -- )
  dup 0= if 2drop ['] notfound exit then
  over c@ prefix>index dup 0 > if 2drop drop ['] notfound exit then
  >r 1 /string recognizer1 r> ['] rectype-colorforth-bw ;

' recognize-colorforth-bw set-forth-recognize

This has only 20 lines (vs. 44), and it uses all the recognizers originally present in forth-recognizer (name, integers (including doubles), FP, etc.). This demonstrates the superior expressive power of the rectypes from [160] over the translators from [r1081].

BTW, I find the presence of both forth-recognize and forth-recognizer confusing, and would prefer to define forth-recognize as deferred word. If you have to have getters and setters, call the getter get-forth-recognize.

In this approach, why do you need to write «[postpone _foo» instead of «]foo» ?

Nobody is suggesting that. But you need to perform xt-post in order to implement ]foo. In your implementation, you do it by reimplementing xt-post for the two recognizers you implement internally to recognize-colorforh-bw. If you would use a detour through postpone instead, you would use the xt-post invoked in that way. And in my implementation above, xt-post is invoked directly.

ruvavatar of ruv

@AntonErtl writes:

Ruv's colorforth-bw implementation demonstrates the shortcomings of the present proposal, because it does not use recognizers nor translators at all for implementing recognize-colorforth-bw; instead, it reimplements everything that the name recognizer and the number recognizer already do internally,

It's wrong. Have a look at L18-L19:

  \ Reuse a recognizer for numbers
  ['] recognize-number-n-prefixed apply-recognizer-cf dup 0= if exit then

It uses the recognizer for numbers. And it uses find-name instead of the recognizer for names (Forth words) just because it's simpler in this case. It does not reuse token translators.

And it only implements dealing with names and single-cell numbers.

Because your original example implemented only that. And I just rewrote your original example.

Finally, the implementation is so long (44 lines without putting it into forth-recognize) that you have not shown it inline, but posted a link to github.

Why count 10 lines of comments at the beginning of the file? Without comments, 31 lines, the same as in your example (lexical size is greater due to nt vs xt, and improvements in the behavior).

By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):

[...]

This has only 20 lines (vs. 44), and it uses all the recognizers originally present in forth-recognizer (name, integers (including doubles), FP, etc.). This demonstrates the superior expressive power of the rectypes from [160] over the translators from [r1081].

(I corrected the [r1081] link in the citation above)

This comparison is incorrect. Below is an implementation against the latest API version (except compile-postpone-qtoken that is a variation of discussed postpone-qtoken, which should be either present or implementable in any variant of API):

: cf-prefix>tt? ( c -- tt true | c false )
  case
    '[' of ['] execute-interpreting endof
    '_' of ['] execute-compiling endof
    ']' of ['] compile-postpone-qtoken endof
    0 exit
  endcase true
;

defer recognize-default  perceptor is recognize-default

: recognize-colorforth-bw ( sd.lexeme -- qt|0 )
  dup 0= if nip exit then
  over c@ cf-prefix>tt? 0= if drop 2drop 0 exit then
  >r 1 /string recognize-default dup if r> exit then rdrop
;

16 lines.

Can be tested in Gforth too:

gforth index.fth example/recognize-colorforth-bw.fth

:noname cf( _1. _drop _s" foo" ) ; execute s" foo" compare 0=  .s \ prints "1 -1"

AntonErtlavatar of AntonErtl

The latest proposal is [r1081] and it does not contain execute-interpreting, execute-compiling, compile-postpone-qtoken, or perceptor. And that's what we were tasked with discussing and giving feedback on. And that's what I did.

ruvavatar of ruv

The latest proposal is [r1081] and it does not contain execute-interpreting, execute-compiling, compile-postpone-qtoken, or perceptor. And that's what we were tasked with discussing and giving feedback on. And that's what I did.

I see, thank you. Actually, [r1081] is outdated, a new version will be prepared soon and then it should be discussed (was noted in the recognizer chat). Nevertheless, my example implementation for recognize-colorforth-bw above is compatible with [r1081] with the following exceptions: it relies on 0 instead of NOTFOUND (you should note how it makes things simpler), and it uses the method compile-postpone-qtoken that appends the compilation semantics of a qualified token to the current definition (this method is missing in [r1081]). The word perceptor is simply a better name than forth-recognizer in [r1081] (I just posted in ForthHub/fep-recognizer a rationale from the chat).

The words execute-interpreting and execute-compiling are general words that are needed anyway to perform interpretation or compilation semantics regardless the initial STATE, they are implemented in the standard Forth as:

: compilation ( comp: true ; S: -- true ; | comp: false ; S: -- false ; )  state @ 0<> ;
: enter-compilation ( comp: false -- true ; S: -- ; | comp: true  ; S: -- ; )  ] ;
: leave-compilation ( comp: true -- false ; S: -- ; | comp: false ; S: -- ; )  postpone [ ;
: execute-interpreting ( i*x xt -- j*x )
  compilation 0= if execute exit then
  leave-compilation execute enter-compilation
;
: execute-compiling ( i*x xt -- j*x )
  compilation if execute exit then
  enter-compilation execute leave-compilation
;

ruvavatar of ruv

@AntonErtl writes:

If you eliminate the state-dependence of translators, then text interpreters that use more than just the xt-int action (e.g., the one for colorforh-bw, see below) can be written without having to deal with state.

Token translators cannot be written without having to deal with state (possibly indirectly), by the term definition. A token translator shall perform different actions depending on the state, and it does not matter how the state is passed to the translator: though the data stack, through a separate stack intended for this purpose, or though an internal variable. The state does not matter in only one case: if the translator shall perform the same action regardless of the state.

Moreover, if you pass a parameter that encodes compilation state or interpretation state not through STATE, you have to make STATE to be in sync with this parameter to guarantee that STATE-dependent words are translated correctly.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Minimalistic Recognizer API

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-15 Add list of example recognizers and their names.
  • 2024-12-15 Take comments after freezing the proposal into account

Problem:

Problem

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

The Forth compiler can be extended easily. The Forth interpreter however has a fixed set of capabilities as outlined in section 3.4 of the standard text: Words from the dictionary and some number formats.

Solution:

It's not possible to use the Forth text interpreter in an application or system extension context. Most interpreters in existing systems use a number of hooks to extent the interpreter. That makes it possible to use a loadable library to implement new data types to be handled like the built-in ones. An example are the floating point numbers. They have their own parsing and data handling words including a stack of their own.

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Furthermore applications need to use system provided and system specific words or have to re-invent the wheel to get numbers with a sign or hex numbers with the $ prefix. The building blocks (FIND, COMPILE,, >NUMBER etc) are available but there is a gap between them and what the Forth interpreter already does.

Important changes to the original proposal:

The Forth interpreter is stateful, but the API should avoid the problems of the STATE variable. In particular, an implementation without STATE should be possible, and there is only one place where the stateful dispatch is necessary.

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

Solution

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The monolithic design of the Forth interpreter is factored into three major blocks:

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

  1. The interpreter. It extracts sub-strings (lexemes) from SOURCE, hands them over to the data parsing and processes the results.
: rec-xt ( addr u -- translator )
  1. The actual data parsing. It analyses lexemes whether they match the criteria for a certain token type. These words, called recognizers, can be grouped to achieve an order of invocation.

  2. The result of the recognizer, a translator and associated data, is handed over to the interpreter.

There is no strict 1:1 relation between a recognizer and the returned translator. A translator for e.g. single cell numbers can be used by different recognizers, a recognizer can return different translators (e.g. single and double cell numbers).

Whenever the Forth text interpreter is mentioned, the standard words EVALUATE (CORE), ' (tick, CORE), INCLUDE-FILE (FILE), INCLUDED (FILE), LOAD (BLOCK) and THRU (BLOCK) are expected to act likewise. This proposal is not about to change these words, but to provide the tools to do so. As long as the standard feature set is used, a complete replacement with recognizers is possible.

Important changes to the Matthias Trute proposal:

  • Make the translators executable to dispatch according to the state (interpreting, compiling, postponing) themselves
  • Use dedicated invocation methods to call a translator for a particular state
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating any particular implementation

The core principle is that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: recognize-xt ( addr u -- translator-stub | 0 )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] noop
  THEN ;

then you should factor the part starting with state @ out and return it as translator:

then you should factor the part starting with STATE @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;
: recognize-xt ( addr u -- ... translator | 0 )
  here place  here find dup IF  [']  translate-xt  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:. If you don't know what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

Typical use

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  BEGIN  parse-name dup  WHILE  forth-recognize ?found execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

Operating a recognizer in a particular state, e.g. to postpone a single word, do

TBD

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

to optain an xt for a name, use something like that:

: ' ( "name" -- xt )
  parse-name forth-recognize ?found
  ['] translate-nt <> #-32 and throw
  name>interpret ;

Proposal:

XY. The optional Recognizer Wordset

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

XY.1 Introduction

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

Recognizers have the form

REC-SOMETYPE ( addr len -- i*x j*r translate-xt | 0/NOTFOUND )

A recognizer takes the string addr len of a lexeme and on success returns a translator translate-xt and additional data on the data and floating point stack.

[IF] NOTFOUND=0

If it fails, it returns 0.

[ELSE] NOTFOUND=xt

If it fails, it returns the xt of NOTFOUND.

For clarity, unless this issue is decided, the non-success return value of a recognizer is notated as 0/NOTFOUND. The reference implementation uses the option 0.

[THEN] notfound

[IF] side-effect

A recognizer shall not have a side effect.

Rationale: Side effects are supposed to all happen inside the translators. This promise allows to try recognize something and fail if the result is not desired without having to roll back unkown changes. Examples: The tick and to recognizer pass a substring of the to be translated string to FORTH-RECOGNIZE, and fail if the result is not a name type.

[THEN] side-effect

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

translator: named subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

name ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the recognized lexeme.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | 0/NOTFOUND ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or 0/NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

[IF] defer

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

FORTH-RECOGNIZE is a deferred word. Changing the system recognizer can be done with IS FORTH-RECOGNIZE, obtaining the system recognizer with ACTION-OF FORTH-RECOGNIZE.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Rationale: use existing API to change it; most simple system have this available, and advanced systems have capabilities to work around limitations.

Create a translator word under the name "name". This word is the only standard way to define a translator.

[ELSE] setter and getter

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

Assign the recognizer xt to FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale: not sufficiently advanced systems can work around the limitations of IS and ACTION-OF better with this API.

[THEN]

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER

Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current state.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. You can not simply set STATE, use EXECUTE and afterwards restore STATE to perform interpretation or compilation semantics, because words can change STATE, so you need the words INTERPRETING and COMPILING defined below. This problem does not apply to POSTPONING, so systems that only want to implement direct access to POSTPONE mode can get away without TRANSLATE:.

[IF] NOTFOUND=0

?FOUND ( translator-xt -- translator-xt | 0 -- never ) RECOGNIZER

Check if the recognizer was successful, and if not, perform a -13 THROW or display an appropriate error message if the exception wordset is not present.

[THEN] NOTFOUND=0

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

[IF] NOTFOUND=0

Assign the recognizer xt to FORTH-RECOGNIZE.

?NOTFOUND ( translator-xt -- translator-xt | 0 -- addr u notfound-xt )

Rationale:

Check if the recognizer was successful. If not, replace the 0 result with the addr u of the last scanned lexeme, and put the xt of the NOTFOUND translator on top of the stack.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

NOTFOUND ( -- never ) RECOGNIZER

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Translator for unsuccessful recognizers: perform a -13 THROW.

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

[THEN] NOTFOUND=0

Rationale:

POSTPONE ( "<spaces>lexeme" -- ) RECOGNIZER

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

Compilation: recognize lexeme. On success, perform the postpone action of the returned translator, otherwise -13 THROW or display the appropriate error message if the exception wordset is not present.

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn on stack and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as n*xt n.

Obtain the recognizer sequence from xt-seq as xt1 .. xtn n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

Translates a name token:

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Interpretation: perform the interpretation semantics of the word

Translates a number.

Compilation: perform the compilation semantics of the word

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Postpone: append the compilation semantics above to the current definition

Translates a double number.

REC-NT ( addr u -- nt translate-nt | 0/NOTFOUND ) RECOGNIZER EXT

Search the dictionary for the string addr u. If successful, return the nt and the xt of TRANSLATE-NT. If the search fails, return 0/NOTFOUND.

TRANSLATE-NUM ( x -- x | ) RECOGNIZER EXT

Translates a number:

Interpretation: keep the number on the stack

Compilation: Append the run-time defined in LITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

TRANSLATE-DNUM ( x1 x2 -- x1 x2 | ) RECOGNIZER EXT

Translates a double number:

Interpretation: keep the numbers on the stack

Compilation: Append the run-time defined in 2LITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

REC-NUM ( addr u -- x translate-num | xd translate-dnum | 0/NOTFOUND ) RECOGNIZER EXT

Convert addr u to a number x and the xt of TRANSLATE-NUM as specified in 3.4.1.3 or a double number xd and the xt of TRANSLATE-DNUM as specified in 8.3.1 if the double number wordset is available. If the conversion fails, return 0/NOTFOUND.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

Translates a floating point number:

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Interpretation: Keep r on the stack

Translates a string.

Compilation: Append the run-time defined in FLITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

REC-FLOAT ( addr u -- r translate-float | 0/NOTFOUND ) RECOGNIZER EXT

Convdert addr u to a number r specified in 12.3.7 if the float wordset is availabe; if the conversion fails, return 0/NOTFOUND.

SCAN-TRANSLATE-STRING ( addr1 u1 string-rest<"> -- addr2 u2 | ) RECOGNIZER EXT

Complete parsing a string: addr1 u1 consists of the starting quote and additional characters up to the first space in the string. addr2 u2 consists of the entire string without the starting quote up to (but not including) the final quote, and translated the escape sequences according to the rules of S\\". >IN is modified appropriately, and points just after the final quote. If there's no final quote in the current line, REFILL can be used to read in more lines, adding corresponding newlines into the string. The final quote can be inside addr1 u1, setting >IN backwards in that case.

Translate the string:

Interpretation: keep the string on the stack

Compilation: Append the run-time defined in SLITERAL to the current definition

Postpone: Append the compilation semantics stated above to the current definition

** TRANSLATE-STRING** ( addr1 u1 -- addr1 u1 | ) RECOGNIZER EXT

Translate the string:

Interpretation: keep the string on the stack

Compilation: Append the run-time defined in SLITERAL to the current definition

Postpone: Append the compilation semantics stated above to the current definition

?SCAN-STRING ( addr1 u1 scan-translate-string string-rest<"> -- addr2 u2 translate-string | ... translator -- ... translator ) RECOGNIZER

If the recognized token is an incompleted string, complete the scanning as defined for SCAN-TRANSLATE-STRING and replace the translator with the xt of TRANSLATE-STRING.

REC-STRING ( addr u -- addr u translate-string | 0/NOTFOUND ) RECOGNIZER EXT

Check if addr u starts with a quote, and return that string and the xt of SCAN-TRANSLATE-STRING if it does, 0/NOTFOUND otherwise.

[IF] Optional API for direct access of translator states

INTERPRETING ( j*x xt -- k*x ) RECOGNIZER EXT

Execute xt-int of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in interpreting state.

COMPILING ( j*x xt -- ) RECOGNIZER EXT

Execute xt-comp of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in compiling state.

POSTPONING ( j*x xt -- ) RECOGNIZER EXT

Execute xt-post of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in postponing state.

GET-STATE ( -- xt ) RECOGNIZER EXT

Obtain the operation xt performed when translating.

SET-STATE ( xt -- ) RECOGNIZER EXT

Makes xt the operation performed when translating. If xt is not related to ' INTERPRETING, ' COMPILING, or ' POSTPONING, do -12 THROW.

[THEN] optional API for direct access of translator states

]] ( -- ) RECOGNIZER EXT

Interpretation semantics: undefined

Compilation semantics: Set the system into postpone state. The interpreter will then perform post-xt of all translators found. Compilation state resumes when [[ is recognized. This word may change STATE and the recognizer sequence to reflect the change of this state.

[[ ( -- ) RECOGNIZER EXT

Interpretation semantics: undefined

Compilaton semantics: undefined

Postpone semantics: enter compilation state, see ]; all changes to STATE and recognizer sequence done by ]] are reverted.

Note: [[ needs special treatment in postpone mode, so it might also use a non-standard translator and be not a word at all.

STATE ( -- addr ) RECOGNIZER

If ]] uses STATE to store postpone state, extends the semantics of 6.1.2250 by adding a second non-zero value. ]] enters this state, and [[ leaves it. Only translators and the code responsible for displaying the prompt can see this third state, as all other words are postponed in this state.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish. It uses NOTFOUND=0.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / 0 )
: ?found ( translator -- translator  |  0 -- never )
  dup 0= IF  -13 throw  THEN ;
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
      parse-name dup  WHILE
      forth-recognize ?found execute
  REPEAT ;
: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

An alternative implementation for TRANSLATE: can use a deferred word:

Defer do-translate
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , , does> do-translate ;
: set-state ( xt -- ) dup is do-translate  >body @ 2 - state ! ;
: get-state ( -- xt ) action-of do-translate ;

Extensions reference implementation:

: ]] -2 state ! ; immediate
: [[ -1 state ! ; immediate
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
:noname dup name>interpret ['] [[ =
  IF    name>interpret execute \ special case
  ELSE  name>compile swap lit, compile,  THEN ;
translate: translate-nt ( nt -- )
: lit,  ( n -- )  postpone literal ;
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;
: rec-nt ( addr u -- nt nt-translator | 0 )
  forth-wordlist find-name-in dup IF  ['] translate-nt  THEN ;
: rec-num ( addr u -- n num-translator | 0 )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop 0  THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator | n num-translator | 0 )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: translate-method: ( n -- )
  Create , DOES> @ cells + >body @ execute ;
0 translate-method: postponing
1 translate-method: compiling
2 translate-method: interpreting
: set-state ( xt -- )
  >body @ 2 - state ! ;
: get-state ( -- xt )
  case state @
      0  of ['] interpreting  endof
      -1 of ['] compiling     endof
      -2 of ['] postponing    endof
  -11 throw
  endcase ;

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

This reference implementation uses a table dispatch only. Note that this can give surprising results when you directly apply a particular state, and one of the words executed (translator or nt/xt found) is a state-smart word. If you want to use combined translators, like

: translate-dnum ( d -- ) >r translate-num r> translate-num ;

you can't do it like this. Neither does this work if you execute state-smart words, as they expect STATE to be set accordingly. Instead, you'll use something like

: translate-method: ( n -- ) Create , DOES> @ dup state @ = IF drop execute EXIT THEN state @ >r state ! execute r> state ! ;

This will definitely work for combined literal translators, because those don't change state anyways.

This will also work for POSTPONE, because apart from the tranlator, no word is actually executed in one-shot POSTPONE, and therefore, no state change is possible.

This will also work for [ and ] (and words using them) while interpreting and compiling, because if you are already in the state from which the state is changed away, you will not restore the state. If you are in the state this will change to, this will work, too, because the state is restored after EXECUTE. This will not work if you are interpreting, and you do a s" ]]" forth-recognize ?found compiling, because that transitions to postponing, and then is reverted to interpreting.

[IF] setter and getter

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

[THEN] setter and getter

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | 0 )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
    EXECUTE DUP IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
  DROP 2DROP R> DROP 0
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- )
  ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n )
  ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

Once you have recognizer sequences, define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
  execute dup IF  drop  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

Apart from the standardized recognizers above, here are some more examples of recognizers:

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-TICK ( addr u -- xt translate-num | 0/NOTFOUND ) If addr u starts with a ``` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-SCOPE ( addr u -- nt translate-nt | 0/NOTFOUND ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TO ( addr u -- xt n translate-to | 0/NOTFOUND ) Handle the following syntax of TO-like operations of value-like words:

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

  • ->name as TO name
  • =>name as IS name
  • +>name as +TO name
  • '>name as ADDR name
  • @>name as ACTION-OF name

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words:

REC-ENV ( addr u -- addr1 u1 translate-env | 0/NOTFOUND ) Takes a pattern in the form of ${name} and provides the name as addr1 u1 on the stack. The corresponding translator TRANSLATE-ENV is responsible for looking up that name in the operating system's environment variable array, or compiling appropriate code to do so.

  • ->value as TO value or IS value
  • +>value as +TO value
  • '>value as ADDR value
  • @>value as ACTION-OF value

REC-COMPLEX ( addr u -- rr ri translate-complex | 0/NOTFOUND ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns the xt of TRANSLATE-COMPLEX on success.

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

Testing

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

T{ 0 recognizer-sequence: RS -> }T

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

T{ :noname 1 ; :noname 2 ; :noname 3 ; translate: translate-1 -> }T T{ :noname 10 ; :noname 20 ; :noname 30 ; translate: translate-2 -> }T

Testing

\ really stupid: 1 character length or 2 characters T{ : rec-1 NIP 1 = IF ['] translate-1 ELSE 0 THEN ; -> }T T{ : rec-2 NIP 2 = IF ['] translate-2 ELSE 0 THEN ; -> }T

TBD

T{ ' translate-1 interpreting -> 1 }T T{ ' translate-1 compiling -> 2 }T T{ ' translate-1 postponing -> 3 }T

\ set and get methods T{ 0 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> 0 }T

T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> ' rec-1 1 }T

T{ ' rec-1 ' rec-2 2 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> ' rec-1 ' rec-2 2 }T

\ testing RECOGNIZE T{ 0 ' RS set-recognizer-sequence -> }T T{ S" 1" RS -> 0 }T T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T T{ S" 1" RS -> ' translate-1 }T T{ S" 10" RS -> 0 }T T{ ' rec-2 ' rec-1 2 ' RS set-recognizer-sequence -> }T T{ S" 10" RS -> ' translate-2 }T


Reply New Version