Proposal: minimalistic core API for recognizers

Considered

This page is dedicated to discussing this specific proposal

ContributeContributions

BerndPaysanavatar of BerndPaysan [160] minimalistic core API for recognizersProposal2020-09-06 09:40:07

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )

XY.3 Additional usage requirements

XY.3.1 Data type id

rectype: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )

state is:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i?x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;

: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;

' minimal-recognizer is forth-recognizer

Testing

JennyBrienavatar of JennyBrien

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE do something like:

    -2 RECTYPE-X

But mostly you'll have the RECTYPE sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:

    : postponed  -2 swap execute ; 

and

    : postponed  @ execute ;

Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)

Compare:

: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;

with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 ;noname name>compile execute ;
 ;noname name>compile swap lit, compile, ;  rectype: rectype-nt

BerndPaysanavatar of BerndPaysan

One possible thing is to have an automatic postpone for literals.

: rectype-lit: ( compile-xt "name" -- )
  create ,
  does> @ swap
  case
      0  of  drop  endof
      -1 of  execute  endof
      -2 of  dup >r execute r> compile,  endof
  endcase ;

' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string

This works with this method, but not with the previous way.

BerndPaysanavatar of BerndPaysan

Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define

: rectype: ( xt-int xt-comp xt-post "name" -- )
  create , , , does> swap 2 + cells + @ execute ;

and then define generic rectypes just like in Matthias Trute's version with rectype:

JennyBrienavatar of JennyBrien

: rectype-lit: ( xt -- )  ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;

not so straightforward, but possible.

ruvavatar of ruv

Previous works

In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: ( i*x token -- j*x ). I described this approach in comp.lang.forth in 2018 (news:pngvcc$pta$1@gioia.aioe.org).

Bernd should also remember comparison of version D with Resolvers API, where I specified this approach, and even several POCs.

and then define generic rectypes just like in Matthias Trute's version with rectype:

I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).

But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.

This works with this method, but not with the previous way.

Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.

: create-rectype-for-literal ( xt-compiler "name" -- )
  ['] noop swap dup rectype:
;

Token translator

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves

RECTYPE-SOMETYPE ( i*x state -- j*x )

By convention, the name for such a word should start from an English verb.

Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.

E.g.:

: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit  r> tt-lit ;

VS

: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s  r> r>  tt-lit-s ;

Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?

Terminology

Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.

Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).

ruvavatar of ruv

Advantages

A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!

BerndPaysanavatar of BerndPaysan

Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.

ruvavatar of ruv

@JennyBrien wrote

Compare: [...] with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 :noname name>compile execute ;
 :noname name>compile swap lit, compile, ;  rectype: rectype-nt

(sic: the full postpone action).

This comparison is incorrect since in the proposed API rectype: (that generates a token translator) can be defined as the following:

: rectype: ( xt-executer xt-compiler xt-postponer "name" -- )
  >r >r >r : ]]
    0  of  [[ r> xt, ]] endof
    -1 of  [[ r> xt, ]] endof
    -2 of  [[ r> xt, ]] endof
    -22 throw
  endcase [[ postpone ;
;

And you can use the same your code to define your rectype-nt or anything else.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- rectype )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3 Additional usage requirements

XY.3.1 Data type id

XY.3.1 Translator

rectype: subtype of xt, and executes with the following stack effect:

translator: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )
SOME-TRANSLATOR ( i*x -- j*x )

state is:

A translator depends on STATE to translate the given arguments:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i?x is the additional information provided by the recognizer.

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6 Glossary

XY.6.1 Recognizer Words

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;
' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- rectype )
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

A translator depends on STATE to translate the given arguments:

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysanavatar of BerndPaysan

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult. Instead of

: postpone ( "name" -- ) parse-name forth-recognizer -2 swap execute ; immediate

it is more convoluted

: postpone ( "name" -- )
  parse-name forth-recognizer
  state @ >r -2 state !  catch  r> state !  throw ; immediate

How to detect [[ at the end of a postpone sequence is also not so trivial.

ruvavatar of ruv

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult.

It's OK. Actually, we distribute complexity among various parts. When we make one thing less complex, we make another thing more complex. But due to the different numbers of occurrences of various things (in systems, libraries, programs) the summary complexity can be less or more.

This approach also makes some things more complex, but the summary complexity decreases, I believe.

Concerning POSTPONE. I think, some useful parts should be factored out.

Also, we don't need to catch exception — usually, it's a stop error, and the state is ambiguous in any case. QUIT resets all the internal states. Concerning programs — we need a standard way to reset the internal states of the Forth text interpreter, regardless of Recognizers proposal.

In my "lexeme resolvers" implementation I use conception of postponing level that can be 0, 1, 2, and introduce the words to increment and to decrement this level. So, POSTPONE is defined as the following:

: postpone  ( " name" --      )   parse-name inc-state translate-lexeme dec-state ( flag ) ?nf ; immediate

Where translate-lexeme is defined as the following:

: perceive-lexeme ( c-addr u -- k*x xt-tt | c-addr u 0 )
  perceptor dup if execute then
;
: translate-lexeme ( i*x c-addr u -- j*x true | c-addr u 0 )
  perceive-lexeme dup if execute true then
;

(Note that in contrast of this proposal, resolvers return ( c-addr u 0 ) on fail)

How to detect [[ at the end of a postpone sequence is also not so trivial.

An appropriate approach is that the word ]] is a parsing word.

: ]] ( -- )
  inc-state begin
    next-lexeme 2dup s" [[" equals 0= while
    translate-lexeme ?nf
  repeat 2drop dec-state
; immediate

So we don't have any problem to detect [[ at the end.

An advantage of the postponing level conception is that the following code works as expected:

: foo [  ]] 123 . [[  ]  ;   foo \ prints 123

In the message news:rdcur5$ga4$1@dont-email.me (the full message: news:rdcn35$sd2$1@dont-email.me) I showed another approach, when postponing action is not required at all (i.e., -2 state in this proposal).

ruvavatar of ruv

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

It's correct in the general case, but it makes a little sense, since any definition meets this stack effect.

So I think we should distinguish the parameters of a translator itself from the effect of translating of the code that is passed to the translator. Possible variants:

\ We can define 'token' data type
TRANSLATE-SOMETOKEN ( i*x token -- j*x )

\ Some hybrid variant
TRANSLATE-SOMETOKEN  ( i*x token{k*x} -- j*x )

\ Only low level data types
TRANSLATE-SOMETOKEN  ( i*x k*x -- j*x ) 

(NB: I use a conventional naming {verb}-{noun} for such a words).

It should be also noted that these x may be distributed in all the stacks: the data stack, the floating-pint stack, the control-flow stack (except token k*x, that cannot be in the contrlo-fow stack).

BerndPaysanavatar of BerndPaysan

Indeed, TRANSLATE-SOMETHING sounds better than SOMETHING-TRANSLATOR.

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

ruvavatar of ruv

"FORTH-RECOGNIZER" name

I thought about FORTH-RECOGNIZER name. It makes a strong impression that this word is similar to FORTH-WORDLIST ( -- wid ). The problem is that it isn't.

FORTH-WORDLIST is a constant (it always return the same value), that indicates a one the same word list among all the word lists. This word list can be included into the search order, and it can be absent in the search order.

By analogy, FORTH-RECOGNIZER should be a constant that indicates a one the same recognizer among all the recognizers. This recognizer can be included into the recognizer that is used by the Forth text interpreter, and it can be absent in the recognizer that is used by the Forth text interpreter. (In accordance with the conception that a sequence of recognizers is also a recognizer).

All these should be right to hold consistent naming. But actually it is wrong. It means, that this name breaks consistency and isn't inappropriate for the proposed word.

FORTH-RECOGNIZER ( -- xt ) can be a word that returns xt of the system's recognizer that is used by the Forth text interpreter by default (i.e. initially).

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

Also, it makes a strong impression that it returns a recognizer. But it's wrong. Also, it's result is analyzed much more often than it's followed by EXECUTE.

Basic methods

By no means, we need

  1. a method that tells the Forth text interpreter to use a given recognizer.
  2. a method that returns the recognizer that is currently used by the Forth text interpreter,
  3. a method that performs the recognizer that is currently used by the Forth text interpreter

A one differed word (a vector) X can solve it:

  1. set: IS X
  2. get: ACTION-OF X
  3. perform: X

But I insist that this approach limits implementations too much. A Forth system can want to perform its internal actions on switching the recognizer that is used by the Forth text interpreter. But it cannot do it, if this recognizer is switched via IS X method. For that, the different getter and setter words are usually provided in the Standard (except very ancient BASE and >IN — due to back compatibility). Yes, perhaps Gforth can attach any additional internal actions for IS X phrase. But we shouldn't complicate all Forth system implementations.

A possible implementation via deferred word and distinct getter and setter words:

defer perceive ( c-addr u -- k*x tt )
: perceptor ( -- xt ) action-of perceive ;
: set-perceptor ( xt -- ) is perceive ;

Perhaps, the more specific names are better (?):

defer perceive-lexeme ( c-addr u -- k*x tt )
: lexeme-perceptor ( -- xt ) action-of perceive-lexeme ;
: set-lexeme-perceptor ( xt -- ) is perceive-lexeme ;

ruvavatar of ruv

Correction: pleas read "By anyway, we need" instead of "By no means, we need".

BerndPaysanavatar of BerndPaysan

´DEFERis a core word now, so usingDEFER` for such a thing is ok. We don't need a special getter and setter for everything.

The implication that FORTH-RECOGNIZER returns a recognizer (and does not, it executes one) is a valid point. A better name is needed. At the moment it is a VALUE and does return a recognizer. Now, it is a deferred word, and does recognize strings. We should keep it with Anton's unification: a sequence of recognizers can be combined to one recognizer. Just because it's now recognizing more different things, it's still a recognizer. No need to find another synonym. Takes string, returns data+translator token ? is a recognizer.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

ruvavatar of ruv

DEFER is a core word now, so using DEFER for such a thing is ok.

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

Back to my first argument, what do you suggest if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter?

You can ask, do I have an example of such requirement. Yes, I do. I want to provide a method to undo such switching in my system. It's similar to effect of the "PREVIOUS" word for the search order. Perhaps you can suggest some solution with the deferred word?

Anton's unification: a sequence of recognizers can be combined to one recognizer.

Yes. I too said that any sequence of recognizers seq-x (from API v4) can be represented as a single recognizer : recognize-x seq-x recognize ;. So, sequences are excessive in the basic API, — a Forth system doesn't need to know is it a sequence or not.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

It's better. But it recognizes not valid FORTH, but anything what the Forth text interpreter can currently recognize (and only that).

Conceptually, this word isn't just a recognizer. There is a single special system's slot for a recognizer that is used by the Forth text interpreter. We can put any recognizer into this slot. We can also perform the recognizer that is placed into this slot. So this word performs the recognizer from this slot. I incline to call this slot "perceptor". And after that the word that performs the recognizer from this slot becomes "perceive".

All recognizer names have the pattern RECOGNIZE-*. The idea is to not put this special word on a par with all other recognizers. For that, its better to find a name that is distinct from the RECOGNIZE-SOMETHING pattern. What do you think?

ruvavatar of ruv

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

This argument is that a Forth system can be implemented as a minimal kernel and additional libraries. And DEFER, IS, ACTION-OF can be available via a library. But when we put a deferred word into this API, we force a system's author to put DEFER, IS, ACTION-OF into the kernel too. But actually they isn't required in the kernel. It would be too restrictive limitation on the implementations.

ruvavatar of ruv

Locate

locate cannot work for lexemes that can be recognized (translated) according to this proposal.

ruvavatar of ruv

The last comment was intend for the proposal of AndrewHaley, and it was mistakenly placed here.

BerndPaysanavatar of BerndPaysan

The recognizer will be an option, as well. At the moment, FORTH-RECOGNIZER is proposed to be a value. That's also a CORE EXT word (as is TO).

A minimalistic system that wants to implement recognizers needs FORTH-RECOGNIZER to be a deferred word. I.e. it needs code for DODEFER. It can load the rest of the deferred word stuff later as extension.

ruvavatar of ruv

Certainly, recognizers is an option. I didn't mean that some required part requires an optional part. I mean that one optional part requires another complex optional part without any good and fair ground.

Yes, a minimalistic system that wants to provide a deferred word needs only code for DODEFER. But it still makes bootstrapping of this system more complex. Hence, when we put a deferred word into API, we make things more complex for some implementations. But we don't even have a rationale for that.

Also, with deferred word we still don't have a solution if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

BerndPaysanavatar of BerndPaysan

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e.

forth-recognizer @ execute execute

Clumsy interfaces can not be changed if you have better things at hand. You can probably wrap around the clumsy interface, e.g.

Defer recognize-forth
addr recognize-forth Constant forth-recognizer

if you can use ADDR to access the deferred word's xt storage location. But then you have another interface, less clumsy, and only available when you have DEFER+ADDR (and ADDR is not even part of the standard).

A minimalistic API, as what I am looking for here is one where you don't have to document much. The less uniform an API is, the more you have to document. The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect. And combinations of recognizers have the same effect. And the system's recognizer is just another one, which you can swap in and out. And you can define a REC-SEQUENCE, where you can manipulate the sequence, and put that into the system's recognizer.

This uniformity is broken when you don't use a deferred word for the system's recognizer — you can't just call that one as you can call the others. You need @ EXECUTE. This is clumsy.

ruvavatar of ruv

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e. forth-recognizer @ execute execute

I don't suggest to use a variable in the interface, — it's even worse than a defer. When a variable is used to change something, this changing cannot be effectively detected. But the requirement is: an ability for a system to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

For that I would prefer to have the separate words in the API: a setter, a getter and a "performer" (a word that performs the recognizer that is currently used by the Forth text interpreter).

What are your objections to have several separate words in the minimalistic API?

The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect.

I strongly support this approach (and I myself suggested this approach too, with slightly different stack effects).

This uniformity is broken when you don't use a deferred word for the system's recognizer

It seems, the set of words like the following (the names may vary):

perceive ( c-addr u -- k*x tt )
set-perceptor ( xt -- )
perceptor ( -- xt )

doesn't brake the mentioned uniformity. Please, clarify.

BerndPaysanavatar of BerndPaysan

Using special setters and getters means you have another (special purpose) DEFER mechanism here. Of course you can implement that with

variable current-perceptor
: perceive ( addr u -- i*j token ) current-perceptor @ execute ;
: set-perceptor ( xt -- ) current-perceptor ! ;
: perceptor ( -- xt ) current-perceptor @ ;

which is probably a bit less implementation effort than DEFER, IS, and ACTION-OF. Or really?

State-Smart:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body state @ if  ]] literal ! [[  else  !  then ; immediate
: action-of  ' >body state @ if  ]] literal @ [[  else  @  then ; immediate

or with NDCS:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body ! ; ndcs: ' >body ]] literal ! [[ ;
: action-of  ' >body @ ;  ndcs: ' >body ]] literal @ [[ ;

DEFER is really a lightweight way to define words that can be changed.

These three lines of code are doing more than the three lines of code you need in addition when you have your special-purpose setter and getter, but they are still one-liners.

Forthers like to reinvent the wheel. But don't overdo this.

ruvavatar of ruv

Using special setters and getters means you have another (special purpose) DEFER mechanism here.

Not necessary. It's up to an author/implementer. It can be just wrappers over standard DEFER, as I shown earlier. So it doesn't mean reinventing the wheel. The implementation details are just hidden.

So the arguments concerning implementation of DEFER mechanism say nothing against three separate words in the minimalistic API.

BTW, having translators for the basic data types, the words is and action-of can be even shorter:

: is  ' >body tt-lit ['] ! tt-xt ; immediate
: action-of  ' >body tt-lit ['] @ tt-xt ; immediate

Well, in any case I would agree that the arguments concerning complexity are more or less weak.

A strong argument (that wasn't yet commented) is about additional actions that a system needs to perform in the setter. What do you thing in this regard?

ruvavatar of ruv

One more strong argument against DEFER word in the API, and pro the different getter and setter is following.

Having DEFER in the API, we cannot define this API over another API at all. But having the different getter and setter (and "executer") — it's possible to defined this API over some other APIs.

Example: news:rn1csa$b02$1@dont-email.me

BerndPaysanavatar of BerndPaysan

Gforth's new header structure allows to overload TO, IS (which are essentially the same) and DEFER@, so we can use the DEFER API to access similar changeable execution patterns implemented differently. So for us, it makes sense to use these access words, regardless how it is implemented.

Other systems may not have this capability, though the way the standard now extends TO for FVALUE and others, you need to have one way or the other to deal with that. Same, when you have an UDEFER in your system for user-specific deferred words.

For me, it is needless clutter of the dictionary and the mental space of the programmer to add setters and getters for things where you already have a generic one. But I see the point that not every system can do this.

ruvavatar of ruv

needless clutter of the dictionary and the mental space of the programmer

I used an approach when a defined word creates two words — a getter and a setter. It's something like after the phrase create-prop x the words x and set-x are created. I didn't noticed any mental space clutter in this regard. Sometimes I redefined set-x to add additional checks or actions.

Concerning dictionary space — I don't see any problem.

But I see the point that not every system can do this.

True. And even if a system can do this, it's done in some system specific way only.

So, due to the combination of all reasons, it's better to have distinct ordinary words in the standard API.

StephenPelcavatar of StephenPelc

If people are interested, I can arrange a virtual meeting for recognisers. They have been workshopped at various Forth Standards meetings but little of substance has emerged so far. I would suggest that such a meeting concentrate on finding what we can agree on.

Note that Forth-200x meetings are public, and the use of real names is strongly encouraged.

ruvavatar of ruv

If people are interested, I can arrange a virtual meeting for recognisers. ... concentrate on finding what we can agree on.

I like this idea.

If people are interested, I will prepare before the meeting a proof of concept — an implementation of Recognizer API v4, Nestable Recognizer Sequences, or some other over this API.

Perhaps, somebody could share his list of questions before the meeting. My list at GitHub.

StefanKavatar of StefanK

A small remark to the POSTPONE test.

We can factor postpone in two parts with state-execute similiar to base-execute:

   : state-execute ( xt s -- )  state@ >r state ! catch r> state ! throw ;
   : POSTPONE ( "name" -- ) parse-name forth-recognizer -2 state-execute ; immediate

That's not very difficult anymore.

StefanKavatar of StefanK

IMHO the idea to use a deferred forth-recognize is good and more flexible than a stack of recognizers. But the translator xt makes postpone more difficult. But we can factor postpone into two parts. One that restores the stack contents at runtime similar to lit,, and one that does the compilation. If we use rectype, similar to the proposal of recognizers from 2018, but with lit, as third method, we get an easy postpone and '. Here, we can reuse the compile method directly by the second factor of postpone.

variable state

: translate>interpret @ ;
: translate>compile   cell+ @ ;
: translate>lit,      cell+ cell+ @ ;
\ Well, its translate>*lit, in fact; i.e. regenerate ( i*x ) at runtime.

Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer perform ( i*x translator -- j*x )

: perform>interpret translate>interpret execute ;
: perform>compile   translate>compile   execute ;

: on  -1 swap ! ;
: off  0 swap ! ;

: [ ['] perform>interpret is perform state off ; IMMEDIATE
: ] ['] perform>compile   is perform state on ;

\ alternativly:
\ :noname is perform state ; dup
\ : [ ['] perform>interpret [ compile, ]  off ; IMMEDIATE
\ : ] ['] perform>compile   [ compile, ]  on ;
\ another alternative:
\ : perform state @ IF translator>compile ELSE translator>interpret THEN execute ;

' [ execute \ initialize state and perform
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer perform
  REPEAT ;

: lit,  ( n -- )  lit lit , ,  ; \ or postpone literal
: throw-13 -13 throw ;

: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;

' throw-13 dup dup translator notfound

' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

' lit,
' lit,
:name ; \ noop
translator translate-const-cell

: rec-nt ( addr u -- nt translate-nt | notfound )
  forth-wordlist find-name-in dup IF  translate-nt  ELSE  drop notfound  THEN ;
: rec-num ( addr u -- n translate-const-cell | notfound )
  0. 2swap >number 0= IF  2drop translate-const-cell  ELSE  2drop drop notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator | n num-translator | notfound )
  2>r 2r@ rec-nt dup notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

\ simple postpone
: postpone ( "name" -- )
    parse'n'recognize
    dup translator>compile >r translator>lit, execute r> compile,
; IMMEDIATE

\ postpone optimized for immedate words
: postpone ( "name" -- ) \ optimized for immediate words
    parse'n'recognize \ ( i*x translator )
    dup translate-nt = IF ( nt translator )
        over immediate? IF drop >cfa compile, exit THEN
    THEN
    dup translator>compile >r translator>lit, execute r> compile, ;
; IMMEDIATE

: ' ( "name" -- xt ) parse'n'recognize translate-nt <> IF throw-13 THEN >cfa ; IMMEDIATE

ruvavatar of ruv

@StefanK, thank you for your participation. But it looks like you have missed too many arguments discussed above.

For example, a deferred word forth-recognizer has a confusing name, and it cannot be acceptable in the API, since it's difficult for the Forth system to detect when its value is changed (NB: it isn't an argument in favor of "stack of recognizers").


: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;
' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

Also, could you please stick to a consistent and clear terminology?

In this example you create not a token translator, but a named token descriptor (and the corresponding token descriptor object). See Common terminology for recognizers (improvements and critics are welcome).

   token descriptor object: an implementation dependent data object that describes how to interpret, how to compile and how to postpone (if any) a token .

I also proposed the following naming convention for the corresponding words:

  • For token translators use names in the form tt-* — that is the abbreviation of translate-token-*; for example, tt-lit, tt-nt.
  • For token descriptors use names in the form td-*— that is the abbreviation of token-descriptor-*; (for example, td-lit, td-nt)

The employed approach in your example to create a token descriptor can be called "three components" approach. A significant disadvantage of this approach is that it doesn't provide a way to reuse old descriptors when you create a new descriptor. Compare to token translators — they can be easily reused to create new token translators. For example, a token translator for a pair ( nt nt ) can be created using the token translator tt-nt for a single nt as:

: tt-2nt ( i*x nt nt -- j*x ) >r tt-nt r> tt-nt ;

To create a token descriptor td-2nt in the three components approach, you need to put in a lot more effort, and you cannot reuse td-nt descriptor.

One possible solution is to don't expose the three components approach in the API and instead provide a special method to create a descriptor from another descriptors. For td-2nt it can look as:

  tt-nt dup 2 descriptor constant tt-2nt
  \ or
  td{ tt-nt tt-nt }td constant tt-2nt

It seems, a user never needs to provide three components for a new descriptor since any new descriptor is always based on some already defined descriptors.

But the approach based on the token translators is far simpler.


By the way, a well known word to get xt from nt is name> ( nt -- xt )(see Forth-83 / "C. Experimental proposal" / "Definition field address conversion operators").

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

XY.3.1 Recognized

translator: subtype of xt, and executes with the following stack effect:

recognized: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )
RECOGNIZED-THING ( j*x i*x state -- k*x )

A translator depends on STATE to translate the given arguments:

A recognized xt acts on the state passed to it on the stack

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND if not.

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

Extensions reference implementation:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

Stacks TBD.

Testing

TBD

ruvavatar of ruv

A recognized xt acts on the state passed to it on the stack

  1. A proper term for "recognized xt" ("recognized execution token") should be chosen. "recognized xt" means "xt that is recognized", but we don't recognize execution tokens, but recognize lexemes. This xt just is a result of recognizing a lexeme. And it should be named according what it does, not according who produces it.

  2. There is no reason to pass state on the stack — we discussed that, and the reference implementation reflect that.

BerndPaysanavatar of BerndPaysan

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE. The reference implementation needs to be adjusted.

For the name of the result values we might want to have another round of bikeshedding. In particular with more native speakers. The current wording represents the last round of bikeshedding.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

recognized: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )

A recognized xt acts on the state passed to it on the stack

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
: recognized-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
: recognized-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
  forth-wordlist find-name-in dup IF  ['] recognized-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;
  0. 2swap >number 0= IF  2drop ['] recognized-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Stacks TBD.

Testing

TBD

ruvavatar of ruv

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE.

I see the following in the report by @ulli on 2021-09-08:

Given the two variations to handle STATE (either in RECOGNIZER:'s DOES> part or in INTERPRET), yesterdays participants favoured to have the single occurrence of STATE in INTERPRET. Further investigation and model implementations will show whether on or the other is beneficial.

So it implies further investigation and model implementations.

Could someone provide a rationale in favor to pass state (better say "mode") via the stack?


My rationale against mode on the stack is following:

  1. It makes combination of token translators cumbersome. E.g. a definition : tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ; becomes far more complex.
  2. In most cases a program doesn't need to execute a token translator in a mode that is different from the current mode (counter examples are welcome, except postpone).
  3. The current mode is already held by the system anyway.
  4. (most importantly) It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API. This loop does not need to know anything about modes and STATE at all. If we are replacing the system's lexeme translator (along with the system's set of token translators), we should be able to replace it along with the system's STATE (and the set of the system's modes) too. Moreover, a token translator can technically ignore the passed value and use it's own set of modes. And even such a simpler mode-beyond-stack API can be implemented over one that passes mode via the stack.

On the other hand I don't think that including (mentioning) STATE in a new API is a good choice. STATE returns a read-only address, and it's provided for back compatibility only. So a better method instead of STATE is required anyway.

Actually, the system's token translators are the only ones who depend on the system's set of modes. In most cases user-defined token translators are defined via system's token translators (which should be standardized) and they need to know nothing about system's set of modes, and about STATE at all. In the same time, a user is able to define own set of recognizers and set of token translators that don't depend on system's set of modes, but introduce own set of modes.

So, the specification for Recognizer API should not mention nether STATE nor a set of magic values like {0, -1, -2}.


Concerning your mode -2 — I believe, the standard word postpone doesn't need an own mode. But in postponing mode, if any, string literals (like s" foo bar") and comments should be properly treated.

ruvavatar of ruv

  1. It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API.

See a block-based illustration of this idea in my Gist.

ruvavatar of ruv

: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

In the reference implementation you keep the mode {0,-1,-2} in the state variable. But it's problematic, regardless how the mode is passed into token translators (directly via the stack, or indirectly using a dedicated method).

Since, when interpretation state is set by [ (so state is set to 0), and then compilation state is set by ], the mode should be the same as before [. If it was -2, it should be set to -2. But information that the mode was -2 is lost. So another variable should be used to keep a flag whether "POSTPONE" mode is active or not.

Actually, the mode of compilation/interpretation and "POSTPONE" mode are not mutual-exclusive. They can be set independently of each other.

For example, the code:

: foo postpone bar [ postpone baz ] ;

conceptually can be pretty clear defined (see my comment). In this fragment, for the lexeme bar "POSTPONE" mode is active in compilation state, for baz "POSTPONE" mode is active in interpretation state.

So, if "POSTPONE" mode is employed, a different variable for it should be used for this reason too.


On the other hand I'm not convinced that we need "POSTPONE" mode at all. Except to implement the word postpone itself, where and how this mode can be used? Even for the questionable construct ]] ... [[ the mode "POSTPONE" isn't needed.

BerndPaysanavatar of BerndPaysan

OK, the most convincing argument is that STATE can go away as specified thing. You can use and combine system translators, and you can create table-driven translators, but STATE is an implementation detail.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

XY.3.1 Translator

recognized: subtype of xt, and executes with the following stack effect:

translator: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )
THING-TRANSLATOR ( j*x i*x -- k*x )

A recognized xt acts on the state passed to it on the stack

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

  • 0 for interpretation
  • -1 for compilation
  • -2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: recognized-nt ( nt state -- )
  case
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: recognized-num ( n state -- )
  case
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] recognized-nt  ELSE  drop ['] notfound  THEN ;
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] recognized-num  ELSE  2drop drop ['] notfound  THEN ;
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  2 cells + @ execute ;
: translate-comp ( translate-xt -- )  cell+ @ execute ;
: translate-post ( translate-xt -- )  @ execute ;

Stacks TBD.

Stacks TBD, copy from Trute proposal.

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  2 cells + @ execute ;
: translate-comp ( translate-xt -- )  cell+ @ execute ;
: translate-post ( translate-xt -- )  @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Stacks TBD, copy from Trute proposal.

Stack library

: STACK: ( size "name" -- )
  CREATE 1+ ( size ) CELLS ALLOT
  0 OVER ! \ empty stack
;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize   ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP NOTFOUND
;
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  dup stack: dup cells negate here + set-stack
  DOES>  recognize ; 
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 1+ ( size ) CELLS ALLOT
  0 OVER ! \ empty stack
;
  CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize   ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP NOTFOUND <> IF
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP NOTFOUND
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  dup stack: dup cells negate here + set-stack
  DOES>  recognize ; 
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;
' minimal-recognizer is forth-recognizer
' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtlavatar of AntonErtl

It seems to me that, given the reference implementation

' translate-nt translate-int
' translate-num translate-int
' translate-dnum translate-int

does not work (nor with translate-comp nor translate-post). Assuming you solve this, do you really want me to define, e.g.,

:noname ['] translate-nt translate-int ;

to get an xt equivalent to one of the xts that has been passed to translate:?

How do you implement POSTPONE (IIRC Matthias Trute has a reference implementation for that)?

What problem is solved by making all the translators state-smart? The problem I see is that you can only access the individual actions by saving state, setting state, executing the translator, and restoring the state. That's not a good design.

The specification of translate: mentions a "current mode". Where do I find out what a "mode" is? This is non-standard terminology.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Translate as in interpretation state

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Translate as in compilation state

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Translate as in postpone state

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanavatar of BerndPaysan

I removed the access words to the xts for a reason: We don't do it that way in Gforth, and we actually found little use of those words.

There are (at least) three ways to create an interface to these operations:

  1. Field-like, i.e. you can read and write the xts for interpretation/compilation/postpone. The typical usage is ( translator ) TRANSLATE-<state> @ EXECUTE.
  2. Valuefield-like, as in the original Trute proposal. You can read the xts for interpretation/compilation/postpone without an extra @, but you can't write them, unless your access word also implements TO. The typical usage is ( translator ) TRANSLATE-<state> EXECUTE.
  3. Deferfield-like, which is what Gforth does. Here, not only the @ is part of the operation, but also the EXECUTE (really a tail-call variant of it). You can neither read nor write the xts, unless the access words also implement IS and ACTION-OF. The typical usage is ( translator ) TRANSLATE-<state>, and that looks about right. Gforth uses different names to not collide with the proposal here.

Gforth offers as an extension to add more states and thus more access words, and that extension also adds IS, TO (which are synonyms) and ACTION-OF to the existing (it is only one, only for postpone state you need it explicit) access word, and also implements the other two for interpret and compile, which are never used on their own. Of course when you add a new state, you need to specify what existing translators do on that state, so IS becomes necessary, and ACTION-OF just comes for free through Gforth's way of implementing TO and variants, of which ACTION-OF is one. This extension is non-standard, and not proposed here, it is used for creating obscured (“tokenized”) source code and reading name=value-style config files.

The experience so far is that outside of this extension, there's only one of those three access words needed at all, which is TRANSLATE-POSTPONE, and it is exclusively needed inside the standard word POSTPONE itself, a word where the implementation is left up to the system anyways. So the usage of these words is extremely limited. Therefore, I deleted them and suggest not to standardize these words, following the “don't speculate” rule and the topic of this proposal to make a minimalistic API, which contains only what's necessary. These words are of little use, and therefore there's no need to standardize them.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Translates a number.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtlavatar of AntonErtl

Making the translators depend on state is a bad idea. It means that everything using the translators becomes infected with this state-dependency. It also means that you cannot implement postpone or ]]...[[ as standard-compliant code (while, with state-independent translators you could).

Moreover, when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state before executing the translators, which is perverse. And in the case of colorforth-bw, again there is no standard way to set state to get the translator to perform xt-post.

BerndPaysanavatar of BerndPaysan

The experience with the usage in Gforth (non-standard extensions excluded) shows that direct calls to translators with a specific state are limited to postpone, which is compile-only and therefore

: postpone ( "name" -- )
  -2 state ! parse-name forth-recognize execute -1 state ! ; immediate compile-only

is not generating surprises (postpone is expected to leave the system in compilation state after it has done its work). In Gforth, ]] and [[ are implemented by changing state, and for recognizing the super-immediate [[ a special recognizer is added to the stack which returns a translator that has a specific postpone effect that changes back to compilation state and drops the additional recognizer from the stack.

' noop dup :noname  ] forth-recognizer stack> drop ; translate: translate-[[

The state-dependent invocation is the 99.9% case for translators, and that includes ]] and [[.

The Forth outer interpreter depends on state (or a similar internal representation). The object that deals with the different actions depending on state is the translator.

The proposal allows you to implement other ways to access the individual methods of a translator, if you need them. It does not encourage anymore to use translators as building blocks for other translators, and we can add wording that only translators created by translate: are standard-conforming. Since there's little use for these other access methods, it does not suggest to standardize those.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

ruvavatar of ruv

Anton writes:

Making the translators depend on state is a bad idea.

It's a subject of terminology. A translator depends on state by definition:

to translate a token: to interpret the token if interpreting, or to compile the token if compiling.

token translator: a Forth definition that translates a token; also, depending on context, the execution token for this Forth definition.


If you want a recognizer to return not a execution token, but some opaque identifier, I suggest to call it "descriptor".

token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.

token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value (i.e., a constant), or a token descriptor object itself.

BerndPaysanavatar of BerndPaysan

As said, this is just about moving things around. There's little difference if you use translator-execute or execute on a translator as specified way to get from the translator to its state-dependent action. It's all system-dependent and hidden, and systems might implement it without even referring to STATE and only update STATE to reflect compilation and interpretation state and otherwise never look at it, and the way the system internally keeps its state can be completely different. The most obvious difference is that with translator-execute, you need another word.

The fact that some abstract data type is executable does not mean EXECUTE is the only way to operate on it. Recognizer sequences are executable in this proposal, and they still can be read out with and set by GET/SET-RECOGNIZER-SEQUENCE. So you can't just define them as colon definitions, you need to go through RECOGNIZER-SEQUENCE: to define them.

Though I don't propose to standardize this, the proposal also suggests to make word list ids executable, and put them together in a recognizer sequence called search-order. word list ids still have be used in other ways, e.g. to add new words to them, and the details are left to the system; but it is clear that they can't be normal colon definitions.

Not providing an abstraction like either translator-execute or execute, and instead putting it directly as state @ abs cells + @ execute into the outer interpreter is a really bad idea, because all details of the reference implementation in which this sequence works become then part of the standard. Other ways to implement it, which may have performance advantages, or not expose the postpone state in STATE would then not be allowed. The following implementation should be standard, too:

: do-translate ( translate-body -- ) 0 + @ execute ;
: state! ( state -- ) dup state ! abs cells ['] do-translate >body cell+ ! ; \ assume threaded code
: translate: ( int-xt comp-xt post-xt "name" - - )
  create swap rot , , ,
  does> do-translate ;
: [ 0 state! ; immediate
: ] -1 state! ;
: ]] 2 cells ['] do-translate >body cell+ ! ; immediate \ STATE left as is

How to recognize [[ is left as exercise to the reader, hint: a recognizer is a good idea, because it actually provides something that is executed at postpone time.

AntonErtlavatar of AntonErtl

By comparison, with the first version of this proposal postpone can be implemented like this:

: postpone parse-name forth-recognize -2 swap execute ; immediate

which would not contain non-standard usage like -2 state !, and it would also work in interpret state (not the most important feature, but a feature nonetheless). And ]] could also be implemented as a standard program.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters. Other uses may be rare, but they exist, and people may come up with more over time if we make the interface flexible enough. The proposal does not propose to standardize state-independent ways to get at the functionality. Therefore, if the proposal is accepted, they don't exist for standard programs, and therefore they are not counterarguments against the disadvantages of the proposed state-dependent-only translators. The fact that this state-dependence means that you cannot use rectypes/translators to build other rectypes/translators is another (minor) argument against the state-dependence.

Concerning having a state-independent rectype as an abstract data type, the first version of this proposal proposed that rectype is an executable word with stack effect ( i*x state -- j*x ) where state would be 0, -1, or -2. This does not expose anything about the internals, and even allows to define rectypes without using a special defining word. The invocation in the text interpreter is ( i*x rectype ) state @ swap execute, and in postpone it´s as shown above.

Alternatively, if the rectype is the address of some data structure, yes, we would need an additional word, maybe rectype-translate ( i*x rectype n -- j*x ) that performs the access to the data structure. The usage in the text interpreter would be ( i*x rectype ) state @ rectype-translate and the usage in postpone would be ( i*x rectype ) -2 rectype-translate.

ruvavatar of ruv

Bernd writes:

The most obvious difference is that with translator-execute, you need another word.

Yes, essentially I agree concerning translator-execute and execute alternatives.

Yet another difference is that with translator-execute the Forth text interpreter (the outer loop) should know this additional word (probably it means more degree of coupling). But with execute — it should not know any additional word.

to make word list ids executable [...] but it is clear that they can't be normal colon definitions

Another example is defer-words (words created by defer), which are executable but are not normal colon definitions — defer! and defer@ can be applied to their xt.

The following implementation should be standard, too

The provided implementation for ]] is system dependent, namely it depends on implementation of Recognizers API. But, anyway, Gforth's ]] can be implemented in a standard way via postpone.

BerndPaysanavatar of BerndPaysan

A translator is the address of a data structure, which also happens to be executable. This is not a contradiction! And there was a proposed standard way to access fields directly, renamed from the Trute proposal (but with otherwise identical, value-field like semantics) to INTERPRET-TRANSLATOR, COMPILE-TRANSLATOR, and POSTPONE-TRANSLATOR. The reason I deleted these is that we don't even use them in Gforth, we only use >POSTPONE, which has a different effect (it does not read out the xts, it executes it right away). If there is consensus that this is the right interface (not a value-field, but a defer-field), I can add this back to the proposal; as well as adding a standard way to set the state without knowing the internals of the system, for which the file recognizer-ext.fs in Gforth also provides a suggestion:

: translate-state ( translator-access-xt -- )
    \ takes a translator access xt, and may check if that actually is one
    >body @ cell/ negate state ! ;

The hypothetical more performant implementation in Reply 1043 would have a different translate-state, which would contain something like

>body @ ['] do-translate >body cell+ !

and only change STATE for interpret/compile.

This proposal is minimalistic on purpose and does not cover all corner cases, especially not those where no consensus has been reached yet.

I consider the magic number dispatch method proposed earlier as not appropriate: this is tied to a specific implementation, and not a good interface. Method invocation or field access should be done by named access words, not by numbers.

ruvavatar of ruv

Anton writes:

postpone can be implemented like this

postpone can be implemented in any variant of the Recognizer API, with more or less code.

A difference is whether the behavior of postpone can be extended/changed without redefinition of postpone.

My point: if users need to extended behavior of postpone without redefinition, then a special method can be specified for that. OTOH, postpone (and ]]) is a poor man's "postponing mode". An example of a more convenient tool is my c-state PoC, which provides a better tool for users, and it even supports any new user-defined special words.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters.

It's not an argument, since the API can provide words like compile-token, execute-token, postpone-token, having ( i*x xt.translator -- j*x ) or ``( ix rectype -- jx )`, which are state-independent and don't restrict usage in the mentioned way.

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-15 Add list of example recognizers and their names.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words: * ->value as TO value or IS value * +>value as +TO value * '>value as ADDR value * @>value as ACTION-OF value xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

Testing

TBD

BerndPaysanavatar of BerndPaysanNew Version: minimalistic core API for recognizers

Hide differences

Author:

Bernd Paysan

Change Log:

  • 2020-09-06 initial version
  • 2020-09-08 taking ruv's approach and vocabulary at translators
  • 2020-09-08 replace the remaining rectypes with translators
  • 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
  • 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
  • 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
  • 2022-09-10 More complete reference implementation
  • 2022-09-10 Add use of extended words in reference implementation
  • 2022-09-10 Typo fixed
  • 2022-09-12 Fix for search order reference implementation
  • 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
  • 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
  • 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
  • 2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
  • 2023-09-15 Add list of example recognizers and their names.

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

  • Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
  • Make the recognizer sequence executable with the same effect as a recognizer
  • Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words:

* ``->``*value* as ``TO ``*value* or ``IS ``*value*
* ``+>``*value* as ``+TO ``*value*
* ``'>``*value* as ``ADDR ``*value*
* ``@>``*value* as ``ACTION-OF ``*value*
  • ->value as TO value or IS value
  • +>value as +TO value
  • '>value as ADDR value
  • @>value as ACTION-OF value

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

Testing

TBD

BerndPaysanavatar of BerndPaysan

Things to discuss, because there are still too many variables.

ToDo:

  • Rename Recognizers from REC-result to RECOGNIZE-result. A solution for .RECOGNIZERS drowning the reader in recognize- could be to skip that prefix, because all recognizers are supposed to have the same prefix, anyways.
  • Revert the name of translators to rectypes or some similar word showing that this does describe a type?
  • Add mode/state-specific access words to the translators again and decide on how they work. I prefer defer-field likes, which right away execute the corresponding action, and not put an xt on the stack for consumption. Defer-fields could work together with IS and ACTION-OF to access the xts within (in Gforth, they do).

Answers to some questions:

A lot of thoughts went into it to make different subsets of this proposal useful on their own, and allow different implementation strategies. The answer to “can I do without feature X” is most likely yes. You can use the subset of the features you want. Stripping away too much results in a subset no longer usable.

  • Opening up the whole idea to small systems is useful to gain wider use.
  • FORTH-RECOGNIZE is a deferred word in the reference implementation on purpose, and that allows changing it without adding more words. To add more implementation options, you can use the setter and getter words (which are optional) if you don't want to implement it as deferred word to swap in and out named sequences.
  • The recognizer sequences do have words to get and set the sequence, so you can just work with a single sequence and set/get it if you like. The nesting capability comes by the magical fact that a recognizer sequence has the same stack effect as a recognizer.
  • You can do without both, because recognizer sequences can be written as colon definitions “by foot”.
  • Named sequences are useful, especially when you swap in recognizer sequences for applications that do something completely different than the Forth recognizer sequence. If you do not want to support named sequences, you can still provide the one single named sequence FORTH-RECOGNIZE, and allow SET-RECOGNIZER-SEQUENCE and GET-RECOGNIZER-SEQUENCE to operate just on that. That's also an option where recognizers are useful without having FORTH-RECOGNIZE being deferred and no RECOGNIZER-SEQUENCE:.
  • The NOTFOUND return for failure is there so that you can always EXECUTE the result of FORTH-RECOGNIZE and don't have to check for errors there.

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later. Actually, parsing should happen in PARSE-NAME. It still seems to be a hack that doesn't have a perfect solution.

ruvavatar of ruv

Rename Recognizers from REC-result to RECOGNIZE-result

In general, an abbreviation or acronym may be acceptable to me. But in this case I prefer RECOGNIZE- rather than REC-. The main disadvantage of rec if that it has misleading associations. And the main advantage of recognize is that it's a whole English word that is very appropriate for our case.

The part referred as "result" should not be a result (of recognizing), but the expected type of the input lexeme. Have a look in your examples — REC-NUM and REC-TICK produce the same result type translate-num, but they accept different types of input lexemes, and these types are identified by NUM and TICK symbols correspondingly.

Thus, the naming form for recognizers can be expressed as RECOGNIZE-{lexeme-type-symbol}.


Revert the name of translators to rectypes or some similar word showing that this does describe a type?

It does describe a type of what? It describes a type of a token i*x, which is a result of recognizing. Actually, a token translator identifies the type of a token i*x, which is a result of recognizing. Then, a token translator is a token type in the same time.

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type. Then, token translators can be named according to the form TT-{token-type-symbol}. It looks elegant to me.

The names of translators are used for two purposes: to call a translator (for example, when we define a new translator via existing translators), and to obtain xt of a translator (which is an identifier for a token type in the same time) — to analyze a result of recognizing. The prefix tt- looks good in these both case.

ruvavatar of ruv

An example of use translators for two different purposes:

\ use "tt-lit" and "tt-2lit" just to call these token translators:

: tt-3lit ( 3*x -- 3*x | )
  >r tt-2lit  r> tt-lit
;

: recognize-forth-lexeme ( sd -- i*x tt ) forth-recognizer execute ;


\ use "tt-xt" to analyze a token type:

: recognize-tick ( sd -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  ['] recognize-forth-lexeme execute-balance2 ( i*x tt|0 n.data-stack n.float-stack )
  2>r dup ['] tt-xt = if 2rdrop exit then drop 2r> fndrop ndrop 0
;

In this implementation for recognize-tick (not tested), the phrase 'foo::bar::baz will work correctly and returns xt of the word baz in the wordlist bar in the wordlist foo, when recognize-pqname for the syntax "::" (example) is a part of forth-recognizer.

To implement this, we do a nesting call of the forth recognizer for another lexeme and then analyze the returned type. If the returned type is not appropriate, we drop the token (from the data stack, and from the floating-point stack, if any). So we need to be sure that calling recognize-forth-lexeme never causes any side effect (other than stacks), even when recognizing succeeds.

NB: when recognize-tick is a part of the current forth-recognizer, executing of recognize-forth-lexeme on some inputs will produce indirectly recursive call of recognize-forth-lexeme (as intended).

ruvavatar of ruv

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later.

  1. It's pretty allowed for a translator to parse the input buffer and/or read the input stream. Some token translators will even do nesting calls of the Forth text interpreter and can throw exceptions.

  2. A problem that a part of the string can be in the input buffer (or even in the input stream) is solved via introducing two translators for strings: one accepts the full string from the stack (e.g. tt-slit), and another (e.g. tt-slit-parsing) accepts the starting part from the stack, and the tail from the input buffer (or input stream). The string recognizer returns one or another depending whether a lexeme is a completed string, or the start of the string only.

I published a reference implementation in 2019, and now updated it for the current proposal.

A string recognizer can be as follows:

: quot ( -- sd.quot ) s\" \"" ;

: recognize-string ( sd.lexeme -- sd tt.slit|tt.slit-parsing | 0 )
  quot match-head 0= if 2drop 0 exit then quot match-tail if ['] tt-slit exit then
  2dup quot contains if 2drop 0 exit then \ fail if '"' is found in the middle of the string
  ['] tt-slit-parsing
;

BerndPaysanavatar of BerndPaysan

The code which I have simply looks like this:

['] translate-string of  json-string!           endof

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

Thinking a bit more about that, I found:

  1. The thing you want to nest is the translator for names
  2. As I said, names should be first, numbers second and the rest third
  3. We have nestable recognizer sequences

So one solution would be to put all recognizers that return nts+translate-nt (or variants of that, e.g. locals have a variant of translate-nt that differs for postpone) in one recognizer stack, which has a name, and can be called without calling the entire recognizer stack. These recognizers have now a predictable effect, and no side effect. Since you can't tick locals, you still have to check for translate-nt, but that's ok. You don't have to go through all weird other recognizers.

In Gforth, .recognizers now can handle and display nested recognizers, and if you split this up like that, it would output:

.recognizers  ~names ( ~nt ( Forth Forth Root ) ~scope ) ~numbers ( ~num ~float ) ~others ( ~string ~to ~dtick ~tick ~body ~complex ~env ~meta )

The ~ is there to abbreviate recognize- (or rec- now).

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

The other solution is what Gforth does: There's a ?REC-NT which does the nesting, the checking for translate-nt, and the cleaning up of the side effects (stacks and >IN). There is the possibility to make this more generic, e.g. create a word TRY-RECOGNIZE which gets an xt, passes that to the result, and if that returns false, everything is cleaned up and false is returned, otherwise whatever that xt left (including the flag) is returned.

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

ruvavatar of ruv

One correction. I wrote:

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type.

It should be read as:

If we want to reflect this idea, we can use the acronym tt, which stands for both: "translate token" (verb) and "token type" (noun).

Data type symbol

To specify formal requirements, we have to introduce a new data type for token translators, which is a subtype of xt. And the abbreviation tt is a good candidate for this data type symbol.

If we will have the data type tt => xt|0, and the symbol sd for the string data type, the naming convention along with the stack diagram for a recognizer can be expressed as:

RECOGNIZE-{lexeme-type-symbol} ( sd.lexeme -- i*x tt ) ( F: -- j*r )

ruvavatar of ruv

@BerndPaysan writes:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases but single-line string literals. But it's impossible.

For example, a recognizer for multi-line string literals cannot parse the full string literal without refilling the source (see my PoC implementation). Should we also restore the input source state to isolate side effects of recognizers?

And it still isn't enough. A recognizer for curly-based markup like foo{ any forth code bar{ nested code }bar ... }foo cannot return something useful, since a useful thing in this case is a created definition or just a side effect of appending some semantics to the current definition. Should we also restore the state of the dictionary?

I think, it's obvious — isolation of all possible side effects of recognizers is not fruitful.

Yes, some recognizers returns objects that are not useful by themself, but they still return information what a given lexeme means, and it's an acceptable price for absent side effects for all recognizers.

Also we separate concerns into things that do have side effects (token translators) and things that don't have side effects (recognizers). It's very useful separation.

ruvavatar of ruv

@BerndPaysan writes:

The code which I have simply looks like this:

['] translate-string of  json-string!           endof

A straightforward solution is to handle each token type of string literals separately. Probably, I would write it as follows:

  'tt-slit           of                  json-string!    endof
  'tt-slit-parsing   of  parse-slit-end  json-string!    endof
  'tt-slit-ml        of  parse-slit-ml   json-string!    endof

(I would use a recognizer for a leading tick, and naming of translators in the form tt-{token-type-symbol})

Or I would factor a helper word as follows:

: ?prepare-tt-slit ( i*x tt -- i*x tt | sd.transient tt.slit )
  case
    'tt-slit           of                  'tt-slit endof
    'tt-slit-parsing   of  parse-slit-end  'tt-slit endof
    'tt-slit-ml        of  parse-slit-ml   'tt-slit endof
  endcase
;

: eval-json ( .. tag -- )
  ?prepare-tt-slit case
    ...
    'tt-slit           of                  json-string!    endof
   ...
  endcase
;

ruvavatar of ruv

Multiple entry points for the Forth recognizer

@BerndPaysan writes:

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

Yes, I also consider such a solution. It's a convenient solution to implement the default Forth recognizer.

But requiring the Forth recognizer to always conform this particular structure of recognizer sequences, and even always be the same instance of this structure, is too restrictive.

And otherwise you don't know the id of the actual names recognizer sequence (and even don't know whether such a sequence exists), and so you cannot check a lexeme against only this sequence (I mean, in implementation of recognize-tick).

Filter recognizer results

Bernd, your word try-recognize is a good factor to filter results, regardless side effects (beyond stacks). Having recognizers without side effects, it can be also implemented in a portable way.

If this word filters for a single token type, it's better to pass a corresponding tt directly (instead of xt.filter).

If this word allows to filter for multiple token types (I assume this variant), it should not drop tt from the stack.

Also, to be more useful, this word should not be bound to the current Forth recognizer only. Then, this word can be called as

apply-recognizer-filter ( sd.lexeme xt.recognizer xt.filter -- i*x tt | 0 )`.

A usage example:

: recognize-forth-name ( sd.lexeme -- nt tt.nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter
;

: find-forth-name ( sd.lexeme -- nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  if exit then 0
;

: find-forth-name? ( sd.lexeme -- nt true | false )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  0<>
;

: recognize-tick ( sd.lexeme -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  forth-recognizer [: dup 'tt-xt = ;] apply-recognizer-filter
;

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

Yes, but, as I show, >in is not enough. Also, it's better to avoid such special cases in general.

ruvavatar of ruv

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

There is no way for a program to check whether it can apply TO to FORTH-RECOGNIZER, or FORTH-RECOGNIZE, or RECOGNIZE-FORTH-LEXEME, etc. Thus, TO cannot be optional. And it cannot be mandatory too. Thus, TO cannot be a part of the API at all — neither RECOGNIZER, nor RECOGNIZER EXT.

Then the getter and setter should be a mandatory part of the API.

ruvavatar of ruv

In continuation to the message:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases

Another example of an unuseful token is the result of the mentioned recognizer REC-TO, which recognizes a syntax like ->foo.

It's too restrictive to require this token be ( xt n tt ), since in some systems it can be just ( xt.set-value tt ), in other — ( addr.data-field xt.store tt ). This means that this token is not something useful to a program at all (apart of translation).

GeraldWodniavatar of GeraldWodni

The committee thanks the authors for all the work. Here is the timetable:

  • Everybody interested in this proposal: please submit your comments by end of October.
  • Bernd (main author): please work this into a new version by the end of the year (2024).
  • The committee will have a special interim meeting for this very proposal in February (final date will be announced in mattermost)
Considered

BerndPaysanavatar of BerndPaysan

Concerning the setters and getters: I would prefer to make it mandatory that FORTH-RECOGNIZE actually is a deferred word, and drop the additional getters and setters completely. DEFER, IS, and ACTION-OF are all CORE EXT; so if you implement the recognizers, you have a dependency on those. The previous proposals had VALUE and TO and interface, which is also CORE EXT.

Gforth could support IS and ACTION-OF on recognizer sequences, too (i.e. assign n elements in order), through its polymorphous approach at all those words for value-style words (TO, +TO, ADDR, IS, ACTION-OF all can do different things on different classes of values), but I guess that would be too much.

Can those setters and getters be optional in case you don't want to support DEFER, and how can a program be written to work in both cases? If you have TOOLS EXT available, you can use

[DEFINED] is [IF]
    is forth-recognize
[ELSE]
    [DEFINED] to [DEFINED] forth-recognizer and [IF]
        to forth-recognizer
    [ELSE]
        set-forth-recognizer
    [THEN]
[THEN]

Yes, this is ugly and shows that having different options is not a good idea.

For the reworked proposal, I will need to restructure the proposal in a way that optional parts I rather want to remove are outlined as such, so that the final rewrite is easy.

ruvavatar of ruv

Deferred words in API considered harmful

make it mandatory that FORTH-RECOGNIZE actually is a deferred word

As we have discussed, the main problem with a deferred word is that it can't be redefined by wrappers that have additional actions when setting or getting the value. In this respect, such a word in an API is as bad as an address-flavoured variable (like BASE).

There is also a recent discussion in comp.lang.forth (link) under subjects "value-flavoured approach" and "value-flavoured structures".

Special data object on failure considered harmful

A question is what to return on failure (unsuccess): a special data object (xt of notfound) or a common data object 0 (zero).

Below is a copy of my rationale from 2023, with some rewording.

There are two strong arguments against a special data object:

  • consistency with other similar words;
  • impact on the overall lexical size of programs.

Consistency

Many standard words returns some data object on success, or 0 (zero) on unsuccess/failure. This is possible because this data object cannot be 0.

For example:

  • name>interpret ( nt -- xt | 0 )
  • find-name ( sd.name -- nt | 0 )
  • find-name-in ( sd.name wid -- nt | 0 )
  • find ( c-addr -- xt n | c-addr 0 )
  • search-wordlist ( sd.name -- xt n | 0 )
  • source-id ( -- fileid | -1 | 0 ) — not a fail, but also an example when zero was chosen instead of a special object.

Also, it is a common approach in practice. This allows common high-order functions operates on the common failure result 0.

Why should not recognizers follow this practice? Why should they return a special id on failure rather than zero?

Lexical code size

Returning notfound on failure makes the code shorter (in terms of lexemes) in some places. But the point is that it makes code longer in more places.

I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of a Recognizer API. In its code:

  • ['] notfound with = or <> is used 10 times, and without checking — 32 times.
  • forth-recognize execute is used 3 times.

If we use 0 (zero) instead of the notfound xt, then:

  • ['] notfound <> is removed 5 times, which eliminates 15 lexemes;
  • ['] notfound = is replaced with 0<> 5 times, which eliminates 10 lexemes;
  • ['] notfound is replaced with 0 32 times, which eliminates 32 lexemes;
  • the definition for notfound is removed, a definition for ?found is added: : ?found ( x.some\0 -- x.some | 0 -- never ) dup 0= -13 and throw ;, which adds not more than +3 lexemes;
  • forth-recognize execute is replaced with forth-recognize ?found execute 3 times, which adds +3 lexemes;
  • the word ?found can be also used after find, search-wordlist, find-name, find-name-in — when the user needs to execute their result at once, and unsuccess should produce an exception.

Thus, replacing of notfound by zero reduces the overall lexical code size in Gforth by more than 51 lexemes, which is more than 0.4KiB in absolute size (as on 2023-09-17).

So why should we prefer an approach that increases the overall lexical size of programs?

AntonErtlavatar of AntonErtl

About the proposal text

The "Problem" section does not describe a problem of Forth-2012 that the proposal wants to solve, but considers a problem with some other recognizer proposal. Similarly, the "Solution" section refers to some other recognizer proposal. This makes these sections useless for readers who have not first read up on the other proposal, which is not even linked here. Parts of the "Solution" section might be useful in another section on transitioning from the earlier proposal.

Instead, the "Problem" and "Solution" sections should describe what benefits this proposal adds to the standard, and how. A possible "Discussion" section and its subsections should describe the benefits of the present approach over possible alternative approaches (if that's too detailed, lazy system implementors will complain about the length of the proposal, but some complaints should just be ignored).

"Typical use" should of course be presented.

State-dependence

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words. This makes the translators hard to use anywhere except in INTERPRET; the proposed-for-standard interface is even hard (actually impossible with standard means) to use in POSTPONE, which is an intended user of translators, as the proposal admits itself:

POSTPONE can do that without a standardized way

Another problem with the state-dependent translators is that it leads to either handwaving specifications of what they do, as evidenced in XY.3.1:

TRANSLATE-THING ( jx ix -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

in the non-specification of what translator-xt does in FORTH-RECOGNIZE and the handwaving specification of "name:" in TRANSLATE:

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

and the nonspecification of what TRANSLATE-NT, TRANSLATE-NUM, TRANSLATE-DNUM, TRANSLATE-FLOAT, and TRANSLATE-STRING do.

Or if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

If you really believe that the state-dependent approach is a good idea, please specify all these words exactly; the editor won't do it for you.

Opaque solution

If there is no need to make POSTPONE implementable in a standardized way, there is no need to make INTERPRET (which is not even standardized) implementable in a standardized way, either, and the translators can become a completely opaque thing that the standard does not document. In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token, and standard programs can only use that for implementing recognizers, but not for implementing text interpreters, POSTPONE, or anything else.

Transparent solutions

Alternatively, we might heed "Don't bury your tools!" and have a more useful interface for translators, like what we have seen in earlier drafts and other recognizer proposals.

POSTPONE

If the idea of the proposal is that xt-post is actually used by POSTPONE, the proposal should specify the change to POSTPONE.

Standardize recognizers

I expect that more people will want to compose existing recognizers into recognizer sequences than to define new recognizers, but they usually need to know about existing recognizers in order to do that. Therefore the proposal (or an accompanying proposal) should not just propose standard translators, but also standard recognizers.

ruvavatar of ruv

@AntonErtl writes:

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words.

I do not like TRANSLATE: either, but for a different reason. Sometimes it is very convenient to define a translator as a quotation (right inside the recognizer), and if you are forced to define a translator only with TRANSLATE:, you cannot define it as a quotation.

This makes the translators hard to use anywhere except in INTERPRET;

Could you provide some examples, please? It seems, this is not harder than performing the observable interpretation semantics using the result of name>interpret.

BerndPaysanavatar of BerndPaysan

Concerning explicit access methods to xt-int/xt-comp/xt-post, I can offer the following compromise, as a result of observations made:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

However, it turns out that you can access xt-post that way, because the only word that possibly changes that state is [[, and that token is a) no visible at all to POSTPONE, and b) changes the state back to compilation state, the state POSTPONE was in anyhow.

So if your system allows full explicit access to all three possible states, all translators have to be defined by 'TRANSLATE:', and I can offer you three access methods. If you only want to implement POSTPONE, the following definition actually works:

: postpone ( "string" -- )
  parse-name forth-recognize ?found
  state @ >r -2 state ! execute r> state ! ; immediate

Further observations:

Gforth has >INTERPRET and >COMPILE, and doesn't use them, only >POSTPONE is used. In exactly one place, in POSTPONE. All other invocations are through EXECUTE or only taking the data. The rest is implementation, including the extension towards more of those access methods for more, user-defined states. The question is whether you need to standardize a tool that has no use case, even if you don't bury it.

A possible way to deal with this is to move this out to a separate proposal.

What has been quite useful is the EXECUTE interface for user-written interpreters, because these are interpret-only, and don't need the complication of state-dependent translators at all.

BerndPaysanavatar of BerndPaysan

Sleeping over it added a few ideas:

The invocation through changing STATE and restoring it works (in general) for translators that will definitely not change STATE as part of their own operation, e.g. translators for literals. It also works (as a special case) for POSTPONE, so a standard implementation of POSTPONE using that method is possible. The postpone mode itself, which needs to change STATE at [[ relies on the dispatch through STATE without setting and restoring STATE around the invocation, so it also works.

The question here is not if that implementation is a quality implementation, but whether it's not so bad that it is another bag full of inconsistencies. IMHO, TRANSLATE-NT will have demonstrable inconsistencies when not using the clean TRANSLATE: interface, but combined literal translators won't. For the cleaner interface outside of POSTPONE itself (which is special case enough to not require the cleaner interface), we have to demonstrate that there is an actual use case. So far, we don't have one.

Both POSTPONE with the additional functionality and the postpone mode ]][[ will become part of the proposal.

ruvavatar of ruv

@BerndPaysan writes:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

This is wrong. Yes, the state change can be a desired result of interpretation or compilation semantics, but this does not prevent us from performing the interpretation or compilation semantics regardless the initial value of STATE, as I shown many times.

We can use the following helpers for that.

\ Useful factors
: compilation ( -- flag )  state @ 0<> ;
: enter-compilation ( comp: false -- true  |  comp: true -- true )  ] ;
: leave-compilation ( comp: true -- false  |  comp: false -- false )  postpone [ ;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in interpreted state.
: execute-interpreting ( i*x xt -- j*x )
  compilation 0= if execute exit then
  leave-compilation execute enter-compilation
;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in compilation state.
: execute-compiling ( i*x xt -- j*x )
  compilation if execute exit then
  enter-compilation execute leave-compilation
;

If we have a result of recognizing with the xt of a translator at the top (i.e., a fully qualified token), and we want to perform the corresponding interpretation semantics regardless of the current value of STATE, we should execute this xt with execute-interpret. If we want to perform the corresponding compilation semantics regardless of the current value of STATE, we should execute it with execute-compiling. If we want to perform the semantics according to STATE, we should just execute this xt with execute.

The key point in the implementation of execute-interpreting and execute-compiling is that we do not save/restore STATE if it matches the semantics we want to perform — and if changing STATE is part of the semantics, STATE will be changed. On the other hand, if STATE does not match the semantics we want to perform, we change STATE and then restore it — if changing STATE is part of the semantics, then it will change STATE to the same value that was saved and to one we restore it to. Thus, the resulting STATE will be as expected!

NB: execute-interpreting and execute-compiling are also required if we want to perform the interpretation semantics or compilation semantics from an nt, regardless the current value of STATE. Moreover, these words are required even in the old approach for Recognizer API, which provides the words RECTYPE>INT and RECTYPE>COMP — because these words have the same flaw for state-dependent words as NAME>INTERPRET and NAME>COMPILE.

ruvavatar of ruv

@AntonErtl writes about token translators:

if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

This is reasonable. And we also discussed in the Recognizer chat group that the standard does not imply such a state as postponing (for the Forth text interpreter).

In my opinion, these problems can be avoided.

  1. We should specify "to translate a token" and "token translator" in the common sections of term definitions, data types and usage requirements. Then, we do not need to repeat that for every token translator. It will be enough to specify that a word is a token translator, and the data type of the token (that it translates).

  2. We can have a word like postpone-token ( qt -- ) that append the compilation semantics of a lexeme, which was recognized as qt, to the current definition. (qt is a qualified token, which is a pair of an unqualified token and token translator ( uq tt ))

So, any additional state, if any, is encapsulated into postpone-token. The standard should not specify it.

Thus, postpone can be defined like this (in my parlance):

: postpone ( "name" -- )
  parse-lexeme perceive ?found postpone-token
;

How postpone-token finds/performs the postponing action from tt — it's an internal problem of implementation. The word postpone-token should throw the exception -32 "invalid name argument" if a postponing action is not associated with tt.

We need to provide a way to associate a postponing action (an xt) with a tt, or to create a new tt from an xt and tt. The postponing action should be optional. The user needs to provide a postponing action only if they want to make postpone applicable to the corresponding lexemes.

For example, we can have an optional word postponable ( tt1 xt.postponing -- tt2 ). Probably, this word shall return the same tt2 for the same input pair ( tt1 xt.postponing ). This word is optional, because it can be implemented along with postpone-token in a standard program, and postpone can be redefined to use then.

Reply New Version