Proposal: minimalistic core API for recognizers

Considered

This page is dedicated to discussing this specific proposal

ContributeContributions

BerndPaysan [160] minimalistic core API for recognizersProposal2020-09-06 09:40:07

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )

XY.3 Additional usage requirements

XY.3.1 Data type id

rectype: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )

state is:

0 for interpretation
-1 for compilation
-2 for POSTPONE

i?x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;

: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;

' minimal-recognizer is forth-recognizer

Testing

JennyBrien [r500] 2020-09-06 17:42:23

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE do something like:

    -2 RECTYPE-X

But mostly you'll have the RECTYPE sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:

    : postponed  -2 swap execute ;

and

    : postponed  @ execute ;

Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)

Compare:

: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;

with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 ;noname name>compile execute ;
 ;noname name>compile swap lit, compile, ;  rectype: rectype-nt

BerndPaysan [r503] 2020-09-06 20:35:46

One possible thing is to have an automatic postpone for literals.

: rectype-lit: ( compile-xt "name" -- )
  create ,
  does> @ swap
  case
      0  of  drop  endof
      -1 of  execute  endof
      -2 of  dup >r execute r> compile,  endof
  endcase ;

' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string

This works with this method, but not with the previous way.

BerndPaysan [r504] 2020-09-06 20:38:26

Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define

: rectype: ( xt-int xt-comp xt-post "name" -- )
  create , , , does> swap 2 + cells + @ execute ;

and then define generic rectypes just like in Matthias Trute's version with rectype:

JennyBrien [r505] 2020-09-07 10:20:28

: rectype-lit: ( xt -- )  ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;

not so straightforward, but possible.

ruv [r507] 2020-09-07 15:10:04

Previous works

In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: ( i*x token -- j*x ). I described this approach in comp.lang.forth in 2018 (news:pngvcc$pta$1@gioia.aioe.org).

Bernd should also remember comparison of version D with Resolvers API, where I specified this approach, and even several POCs.

and then define generic rectypes just like in Matthias Trute's version with rectype:

I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).

But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.

This works with this method, but not with the previous way.

Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.

: create-rectype-for-literal ( xt-compiler "name" -- )
  ['] noop swap dup rectype:
;

Token translator

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves

RECTYPE-SOMETYPE ( i*x state -- j*x )

By convention, the name for such a word should start from an English verb.

Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.

E.g.:

: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit  r> tt-lit ;

: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s  r> r>  tt-lit-s ;

Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?

Terminology

Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.

Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).

ruv [r508] 2020-09-07 15:58:29

Advantages

A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!

BerndPaysan [r510] 2020-09-07 17:12:08

Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.

ruv [r513] 2020-09-07 18:02:52

@JennyBrien wrote

Compare: [...] with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 :noname name>compile execute ;
 :noname name>compile swap lit, compile, ;  rectype: rectype-nt

(sic: the full postpone action).

This comparison is incorrect since in the proposed API rectype: (that generates a token translator) can be defined as the following:

: rectype: ( xt-executer xt-compiler xt-postponer "name" -- )
  >r >r >r : ]]
    0  of  [[ r> xt, ]] endof
    -1 of  [[ r> xt, ]] endof
    -2 of  [[ r> xt, ]] endof
    -22 throw
  endcase [[ postpone ;
;

And you can use the same your code to define your rectype-nt or anything else.

BerndPaysanNew Version: minimalistic core API for recognizers [r514] 2020-09-08 08:36:42

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- rectype )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

A translator depends on STATE to translate the given arguments:

0 for interpretation
-1 for compilation
-2 for POSTPONE

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysanNew Version: minimalistic core API for recognizers [r515] 2020-09-08 08:39:23

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

A translator depends on STATE to translate the given arguments:

0 for interpretation
-1 for compilation
-2 for POSTPONE

i*x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:

: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Testing

BerndPaysan [r516] 2020-09-08 11:33:30

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult. Instead of

: postpone ( "name" -- ) parse-name forth-recognizer -2 swap execute ; immediate

it is more convoluted

: postpone ( "name" -- )
  parse-name forth-recognizer
  state @ >r -2 state !  catch  r> state !  throw ; immediate

How to detect [[ at the end of a postpone sequence is also not so trivial.

ruv [r517] 2020-09-08 14:48:20

Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult.

It's OK. Actually, we distribute complexity among various parts. When we make one thing less complex, we make another thing more complex. But due to the different numbers of occurrences of various things (in systems, libraries, programs) the summary complexity can be less or more.

This approach also makes some things more complex, but the summary complexity decreases, I believe.

Concerning POSTPONE. I think, some useful parts should be factored out.

Also, we don't need to catch exception — usually, it's a stop error, and the state is ambiguous in any case. QUIT resets all the internal states. Concerning programs — we need a standard way to reset the internal states of the Forth text interpreter, regardless of Recognizers proposal.

In my "lexeme resolvers" implementation I use conception of postponing level that can be 0, 1, 2, and introduce the words to increment and to decrement this level. So, POSTPONE is defined as the following:

: postpone  ( " name" --      )   parse-name inc-state translate-lexeme dec-state ( flag ) ?nf ; immediate

Where translate-lexeme is defined as the following:

: perceive-lexeme ( c-addr u -- k*x xt-tt | c-addr u 0 )
  perceptor dup if execute then
;
: translate-lexeme ( i*x c-addr u -- j*x true | c-addr u 0 )
  perceive-lexeme dup if execute true then
;

(Note that in contrast of this proposal, resolvers return ( c-addr u 0 ) on fail)

How to detect [[ at the end of a postpone sequence is also not so trivial.

An appropriate approach is that the word ]] is a parsing word.

: ]] ( -- )
  inc-state begin
    next-lexeme 2dup s" [[" equals 0= while
    translate-lexeme ?nf
  repeat 2drop dec-state
; immediate

So we don't have any problem to detect [[ at the end.

An advantage of the postponing level conception is that the following code works as expected:

: foo [  ]] 123 . [[  ]  ;   foo \ prints 123

In the message news:rdcur5$ga4$1@dont-email.me (the full message: news:rdcn35$sd2$1@dont-email.me) I showed another approach, when postponing action is not required at all (i.e., -2 state in this proposal).

ruv [r518] 2020-09-08 16:45:16

translator: subtype of xt, and executes with the following stack effect:

SOME-TRANSLATOR ( i*x -- j*x )

It's correct in the general case, but it makes a little sense, since any definition meets this stack effect.

So I think we should distinguish the parameters of a translator itself from the effect of translating of the code that is passed to the translator. Possible variants:

\ We can define 'token' data type
TRANSLATE-SOMETOKEN ( i*x token -- j*x )

\ Some hybrid variant
TRANSLATE-SOMETOKEN  ( i*x token{k*x} -- j*x )

\ Only low level data types
TRANSLATE-SOMETOKEN  ( i*x k*x -- j*x )

(NB: I use a conventional naming {verb}-{noun} for such a words).

It should be also noted that these x may be distributed in all the stacks: the data stack, the floating-pint stack, the control-flow stack (except token k*x, that cannot be in the contrlo-fow stack).

BerndPaysan [r519] 2020-09-08 20:14:11

Indeed, TRANSLATE-SOMETHING sounds better than SOMETHING-TRANSLATOR.

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

ruv [r520] 2020-09-09 08:13:21

"FORTH-RECOGNIZER" name

I thought about FORTH-RECOGNIZER name. It makes a strong impression that this word is similar to FORTH-WORDLIST ( -- wid ). The problem is that it isn't.

FORTH-WORDLIST is a constant (it always return the same value), that indicates a one the same word list among all the word lists. This word list can be included into the search order, and it can be absent in the search order.

By analogy, FORTH-RECOGNIZER should be a constant that indicates a one the same recognizer among all the recognizers. This recognizer can be included into the recognizer that is used by the Forth text interpreter, and it can be absent in the recognizer that is used by the Forth text interpreter. (In accordance with the conception that a sequence of recognizers is also a recognizer).

All these should be right to hold consistent naming. But actually it is wrong. It means, that this name breaks consistency and isn't inappropriate for the proposed word.

FORTH-RECOGNIZER ( -- xt ) can be a word that returns xt of the system's recognizer that is used by the Forth text interpreter by default (i.e. initially).

FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.

Also, it makes a strong impression that it returns a recognizer. But it's wrong. Also, it's result is analyzed much more often than it's followed by EXECUTE.

Basic methods

By no means, we need

a method that tells the Forth text interpreter to use a given recognizer.
a method that returns the recognizer that is currently used by the Forth text interpreter,
a method that performs the recognizer that is currently used by the Forth text interpreter

A one differed word (a vector) X can solve it:

set: IS X
get: ACTION-OF X
perform: X

But I insist that this approach limits implementations too much. A Forth system can want to perform its internal actions on switching the recognizer that is used by the Forth text interpreter. But it cannot do it, if this recognizer is switched via IS X method. For that, the different getter and setter words are usually provided in the Standard (except very ancient BASE and >IN — due to back compatibility). Yes, perhaps Gforth can attach any additional internal actions for IS X phrase. But we shouldn't complicate all Forth system implementations.

A possible implementation via deferred word and distinct getter and setter words:

defer perceive ( c-addr u -- k*x tt )
: perceptor ( -- xt ) action-of perceive ;
: set-perceptor ( xt -- ) is perceive ;

Perhaps, the more specific names are better (?):

defer perceive-lexeme ( c-addr u -- k*x tt )
: lexeme-perceptor ( -- xt ) action-of perceive-lexeme ;
: set-lexeme-perceptor ( xt -- ) is perceive-lexeme ;

ruv [r521] 2020-09-09 08:25:37

Correction: pleas read "By anyway, we need" instead of "By no means, we need".

BerndPaysan [r524] 2020-09-10 21:36:12

´DEFERis a core word now, so usingDEFER` for such a thing is ok. We don't need a special getter and setter for everything.

The implication that FORTH-RECOGNIZER returns a recognizer (and does not, it executes one) is a valid point. A better name is needed. At the moment it is a VALUE and does return a recognizer. Now, it is a deferred word, and does recognize strings. We should keep it with Anton's unification: a sequence of recognizers can be combined to one recognizer. Just because it's now recognizing more different things, it's still a recognizer. No need to find another synonym. Takes string, returns data+translator token ? is a recognizer.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

ruv [r525] 2020-09-11 03:46:35

DEFER is a core word now, so using DEFER for such a thing is ok.

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

Back to my first argument, what do you suggest if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter?

You can ask, do I have an example of such requirement. Yes, I do. I want to provide a method to undo such switching in my system. It's similar to effect of the "PREVIOUS" word for the search order. Perhaps you can suggest some solution with the deferred word?

Anton's unification: a sequence of recognizers can be combined to one recognizer.

Yes. I too said that any sequence of recognizers seq-x (from API v4) can be represented as a single recognizer : recognize-x seq-x recognize ;. So, sequences are excessive in the basic API, — a Forth system doesn't need to know is it a sequence or not.

Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.

It's better. But it recognizes not valid FORTH, but anything what the Forth text interpreter can currently recognize (and only that).

Conceptually, this word isn't just a recognizer. There is a single special system's slot for a recognizer that is used by the Forth text interpreter. We can put any recognizer into this slot. We can also perform the recognizer that is placed into this slot. So this word performs the recognizer from this slot. I incline to call this slot "perceptor". And after that the word that performs the recognizer from this slot becomes "perceive".

All recognizer names have the pattern RECOGNIZE-*. The idea is to not put this special word on a par with all other recognizers. For that, its better to find a name that is distinct from the RECOGNIZE-SOMETHING pattern. What do you think?

ruv [r526] 2020-09-11 04:10:31

Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.

This argument is that a Forth system can be implemented as a minimal kernel and additional libraries. And DEFER, IS, ACTION-OF can be available via a library. But when we put a deferred word into this API, we force a system's author to put DEFER, IS, ACTION-OF into the kernel too. But actually they isn't required in the kernel. It would be too restrictive limitation on the implementations.

ruv [r527] 2020-09-11 12:23:36

Locate

locate cannot work for lexemes that can be recognized (translated) according to this proposal.

ruv [r528] 2020-09-11 17:45:12

The last comment was intend for the proposal of AndrewHaley, and it was mistakenly placed here.

BerndPaysan [r529] 2020-09-11 21:55:54

The recognizer will be an option, as well. At the moment, FORTH-RECOGNIZER is proposed to be a value. That's also a CORE EXT word (as is TO).

A minimalistic system that wants to implement recognizers needs FORTH-RECOGNIZER to be a deferred word. I.e. it needs code for DODEFER. It can load the rest of the deferred word stuff later as extension.

ruv [r530] 2020-09-12 07:46:25

Certainly, recognizers is an option. I didn't mean that some required part requires an optional part. I mean that one optional part requires another complex optional part without any good and fair ground.

Yes, a minimalistic system that wants to provide a deferred word needs only code for DODEFER. But it still makes bootstrapping of this system more complex. Hence, when we put a deferred word into API, we make things more complex for some implementations. But we don't even have a rationale for that.

Also, with deferred word we still don't have a solution if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

BerndPaysan [r539] 2020-09-14 21:33:55

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e.

forth-recognizer @ execute execute

Clumsy interfaces can not be changed if you have better things at hand. You can probably wrap around the clumsy interface, e.g.

Defer recognize-forth
addr recognize-forth Constant forth-recognizer

if you can use ADDR to access the deferred word's xt storage location. But then you have another interface, less clumsy, and only available when you have DEFER+ADDR (and ADDR is not even part of the standard).

A minimalistic API, as what I am looking for here is one where you don't have to document much. The less uniform an API is, the more you have to document. The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect. And combinations of recognizers have the same effect. And the system's recognizer is just another one, which you can swap in and out. And you can define a REC-SEQUENCE, where you can manipulate the sequence, and put that into the system's recognizer.

This uniformity is broken when you don't use a deferred word for the system's recognizer — you can't just call that one as you can call the others. You need @ EXECUTE. This is clumsy.

ruv [r540] 2020-09-15 10:04:01

CORE has only VARIABLE as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e. forth-recognizer @ execute execute

I don't suggest to use a variable in the interface, — it's even worse than a defer. When a variable is used to change something, this changing cannot be effectively detected. But the requirement is: an ability for a system to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.

For that I would prefer to have the separate words in the API: a setter, a getter and a "performer" (a word that performs the recognizer that is currently used by the Forth text interpreter).

What are your objections to have several separate words in the minimalistic API?

The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect.

I strongly support this approach (and I myself suggested this approach too, with slightly different stack effects).

This uniformity is broken when you don't use a deferred word for the system's recognizer

It seems, the set of words like the following (the names may vary):

perceive ( c-addr u -- k*x tt )
set-perceptor ( xt -- )
perceptor ( -- xt )

doesn't brake the mentioned uniformity. Please, clarify.

BerndPaysan [r541] 2020-09-16 14:26:03

Using special setters and getters means you have another (special purpose) DEFER mechanism here. Of course you can implement that with

variable current-perceptor
: perceive ( addr u -- i*j token ) current-perceptor @ execute ;
: set-perceptor ( xt -- ) current-perceptor ! ;
: perceptor ( -- xt ) current-perceptor @ ;

which is probably a bit less implementation effort than DEFER, IS, and ACTION-OF. Or really?

State-Smart:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body state @ if  ]] literal ! [[  else  !  then ; immediate
: action-of  ' >body state @ if  ]] literal @ [[  else  @  then ; immediate

or with NDCS:

: defer  Create ['] noop ,  does> @ execute ;
: is  ' >body ! ; ndcs: ' >body ]] literal ! [[ ;
: action-of  ' >body @ ;  ndcs: ' >body ]] literal @ [[ ;

DEFER is really a lightweight way to define words that can be changed.

These three lines of code are doing more than the three lines of code you need in addition when you have your special-purpose setter and getter, but they are still one-liners.

Forthers like to reinvent the wheel. But don't overdo this.

ruv [r545] 2020-09-19 12:24:09

Using special setters and getters means you have another (special purpose) DEFER mechanism here.

Not necessary. It's up to an author/implementer. It can be just wrappers over standard DEFER, as I shown earlier. So it doesn't mean reinventing the wheel. The implementation details are just hidden.

So the arguments concerning implementation of DEFER mechanism say nothing against three separate words in the minimalistic API.

BTW, having translators for the basic data types, the words is and action-of can be even shorter:

: is  ' >body tt-lit ['] ! tt-xt ; immediate
: action-of  ' >body tt-lit ['] @ tt-xt ; immediate

Well, in any case I would agree that the arguments concerning complexity are more or less weak.

A strong argument (that wasn't yet commented) is about additional actions that a system needs to perform in the setter. What do you thing in this regard?

ruv [r565] 2020-10-29 00:48:42

One more strong argument against DEFER word in the API, and pro the different getter and setter is following.

Having DEFER in the API, we cannot define this API over another API at all. But having the different getter and setter (and "executer") — it's possible to defined this API over some other APIs.

Example: news:rn1csa$b02$1@dont-email.me

BerndPaysan [r571] 2020-11-19 19:58:48

Gforth's new header structure allows to overload TO, IS (which are essentially the same) and DEFER@, so we can use the DEFER API to access similar changeable execution patterns implemented differently. So for us, it makes sense to use these access words, regardless how it is implemented.

Other systems may not have this capability, though the way the standard now extends TO for FVALUE and others, you need to have one way or the other to deal with that. Same, when you have an UDEFER in your system for user-specific deferred words.

For me, it is needless clutter of the dictionary and the mental space of the programmer to add setters and getters for things where you already have a generic one. But I see the point that not every system can do this.

ruv [r572] 2020-11-20 14:25:42

needless clutter of the dictionary and the mental space of the programmer

I used an approach when a defined word creates two words — a getter and a setter. It's something like after the phrase create-prop x the words x and set-x are created. I didn't noticed any mental space clutter in this regard. Sometimes I redefined set-x to add additional checks or actions.

Concerning dictionary space — I don't see any problem.

But I see the point that not every system can do this.

True. And even if a system can do this, it's done in some system specific way only.

So, due to the combination of all reasons, it's better to have distinct ordinary words in the standard API.

StephenPelc [r653] 2021-04-19 04:21:03

If people are interested, I can arrange a virtual meeting for recognisers. They have been workshopped at various Forth Standards meetings but little of substance has emerged so far. I would suggest that such a meeting concentrate on finding what we can agree on.

Note that Forth-200x meetings are public, and the use of real names is strongly encouraged.

ruv [r663] 2021-04-19 17:05:40

If people are interested, I can arrange a virtual meeting for recognisers. ... concentrate on finding what we can agree on.

I like this idea.

If people are interested, I will prepare before the meeting a proof of concept — an implementation of Recognizer API v4, Nestable Recognizer Sequences, or some other over this API.

Perhaps, somebody could share his list of questions before the meeting. My list at GitHub.

StefanK [r831] 2022-06-25 10:16:33

A small remark to the POSTPONE test.

We can factor postpone in two parts with state-execute similiar to base-execute:

   : state-execute ( xt s -- )  state@ >r state ! catch r> state ! throw ;
   : POSTPONE ( "name" -- ) parse-name forth-recognizer -2 state-execute ; immediate

That's not very difficult anymore.

StefanK [r832] 2022-07-04 21:02:03

IMHO the idea to use a deferred forth-recognize is good and more flexible than a stack of recognizers. But the translator xt makes postpone more difficult. But we can factor postpone into two parts. One that restores the stack contents at runtime similar to lit,, and one that does the compilation. If we use rectype, similar to the proposal of recognizers from 2018, but with lit, as third method, we get an easy postpone and '. Here, we can reuse the compile method directly by the second factor of postpone.

variable state

: translate>interpret @ ;
: translate>compile   cell+ @ ;
: translate>lit,      cell+ cell+ @ ;
\ Well, its translate>*lit, in fact; i.e. regenerate ( i*x ) at runtime.

Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer perform ( i*x translator -- j*x )

: perform>interpret translate>interpret execute ;
: perform>compile   translate>compile   execute ;

: on  -1 swap ! ;
: off  0 swap ! ;

: [ ['] perform>interpret is perform state off ; IMMEDIATE
: ] ['] perform>compile   is perform state on ;

\ alternativly:
\ :noname is perform state ; dup
\ : [ ['] perform>interpret [ compile, ]  off ; IMMEDIATE
\ : ] ['] perform>compile   [ compile, ]  on ;
\ another alternative:
\ : perform state @ IF translator>compile ELSE translator>interpret THEN execute ;

' [ execute \ initialize state and perform
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer perform
  REPEAT ;

: lit,  ( n -- )  lit lit , ,  ; \ or postpone literal
: throw-13 -13 throw ;

: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;

' throw-13 dup dup translator notfound

' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

' lit,
' lit,
:name ; \ noop
translator translate-const-cell

: rec-nt ( addr u -- nt translate-nt | notfound )
  forth-wordlist find-name-in dup IF  translate-nt  ELSE  drop notfound  THEN ;
: rec-num ( addr u -- n translate-const-cell | notfound )
  0. 2swap >number 0= IF  2drop translate-const-cell  ELSE  2drop drop notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator | n num-translator | notfound )
  2>r 2r@ rec-nt dup notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

\ simple postpone
: postpone ( "name" -- )
    parse'n'recognize
    dup translator>compile >r translator>lit, execute r> compile,
; IMMEDIATE

\ postpone optimized for immedate words
: postpone ( "name" -- ) \ optimized for immediate words
    parse'n'recognize \ ( i*x translator )
    dup translate-nt = IF ( nt translator )
        over immediate? IF drop >cfa compile, exit THEN
    THEN
    dup translator>compile >r translator>lit, execute r> compile, ;
; IMMEDIATE

: ' ( "name" -- xt ) parse'n'recognize translate-nt <> IF throw-13 THEN >cfa ; IMMEDIATE

ruv [r834] 2022-07-14 22:32:24

@StefanK, thank you for your participation. But it looks like you have missed too many arguments discussed above.

For example, a deferred word forth-recognizer has a confusing name, and it cannot be acceptable in the API, since it's difficult for the Forth system to detect when its value is changed (NB: it isn't an argument in favor of "stack of recognizers").

: translator ( xt-*lit, xt-compile  xt-interpret "name" -- )
    create , , , ;

' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt

Also, could you please stick to a consistent and clear terminology?

In this example you create not a token translator, but a named token descriptor (and the corresponding token descriptor object). See Common terminology for recognizers (improvements and critics are welcome).

token descriptor object: an implementation dependent data object that describes how to interpret, how to compile and how to postpone (if any) a token .

I also proposed the following naming convention for the corresponding words:

For token translators use names in the form tt-* — that is the abbreviation of translate-token-*; for example, tt-lit, tt-nt.
For token descriptors use names in the form td-*— that is the abbreviation of token-descriptor-*; (for example, td-lit, td-nt)

The employed approach in your example to create a token descriptor can be called "three components" approach. A significant disadvantage of this approach is that it doesn't provide a way to reuse old descriptors when you create a new descriptor. Compare to token translators — they can be easily reused to create new token translators. For example, a token translator for a pair ( nt nt ) can be created using the token translator tt-nt for a single nt as:

: tt-2nt ( i*x nt nt -- j*x ) >r tt-nt r> tt-nt ;

To create a token descriptor td-2nt in the three components approach, you need to put in a lot more effort, and you cannot reuse td-nt descriptor.

One possible solution is to don't expose the three components approach in the API and instead provide a special method to create a descriptor from another descriptors. For td-2nt it can look as:

  tt-nt dup 2 descriptor constant tt-2nt
  \ or
  td{ tt-nt tt-nt }td constant tt-2nt

It seems, a user never needs to provide three components for a new descriptor since any new descriptor is always based on some already defined descriptors.

But the approach based on the token translators is far simpler.

By the way, a well known word to get xt from nt is name> ( nt -- xt )(see Forth-83 / "C. Experimental proposal" / "Definition field address conversion operators").

BerndPaysanNew Version: minimalistic core API for recognizers [r867] 2022-09-07 23:27:23

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

recognized: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )

A recognized xt acts on the state passed to it on the stack

0 for interpretation
-1 for compilation
-2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
  case  state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: num-translator ( n -- )
  case  state @
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] nt-translator  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] num-translator  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Stacks TBD.

Testing

TBD

ruv [r868] 2022-09-08 07:24:58

A recognized xt acts on the state passed to it on the stack

A proper term for "recognized xt" ("recognized execution token") should be chosen. "recognized xt" means "xt that is recognized", but we don't recognize execution tokens, but recognize lexemes. This xt just is a result of recognizing a lexeme. And it should be named according what it does, not according who produces it.
There is no reason to pass state on the stack — we discussed that, and the reference implementation reflect that.

BerndPaysan [r871] 2022-09-08 14:53:40

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE. The reference implementation needs to be adjusted.

For the name of the result values we might want to have another round of bikeshedding. In particular with more native speakers. The current wording represents the last round of bikeshedding.

BerndPaysanNew Version: minimalistic core API for recognizers [r872] 2022-09-08 14:57:40

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: word-translator ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  word-translator
  ELSE  drop ['] notfound  THEN ;

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Recognized

recognized: subtype of xt, and executes with the following stack effect:

RECOGNIZED-THING ( j*x i*x state -- k*x )

A recognized xt acts on the state passed to it on the stack

0 for interpretation
-1 for compilation
-2 for POSTPONE

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: recognized-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: recognized-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] recognized-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] recognized-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

Stacks TBD.

Testing

TBD

ruv [r874] 2022-09-08 21:18:31

The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE.

I see the following in the report by @ulli on 2021-09-08:

Given the two variations to handle STATE (either in RECOGNIZER:'s DOES> part or in INTERPRET), yesterdays participants favoured to have the single occurrence of STATE in INTERPRET. Further investigation and model implementations will show whether on or the other is beneficial.

So it implies further investigation and model implementations.

Could someone provide a rationale in favor to pass state (better say "mode") via the stack?

My rationale against mode on the stack is following:

It makes combination of token translators cumbersome. E.g. a definition : tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ; becomes far more complex.
In most cases a program doesn't need to execute a token translator in a mode that is different from the current mode (counter examples are welcome, except postpone).
The current mode is already held by the system anyway.
(most importantly) It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API. This loop does not need to know anything about modes and STATE at all. If we are replacing the system's lexeme translator (along with the system's set of token translators), we should be able to replace it along with the system's STATE (and the set of the system's modes) too. Moreover, a token translator can technically ignore the passed value and use it's own set of modes. And even such a simpler mode-beyond-stack API can be implemented over one that passes mode via the stack.

On the other hand I don't think that including (mentioning) STATE in a new API is a good choice. STATE returns a read-only address, and it's provided for back compatibility only. So a better method instead of STATE is required anyway.

Actually, the system's token translators are the only ones who depend on the system's set of modes. In most cases user-defined token translators are defined via system's token translators (which should be standardized) and they need to know nothing about system's set of modes, and about STATE at all. In the same time, a user is able to define own set of recognizers and set of token translators that don't depend on system's set of modes, but introduce own set of modes.

So, the specification for Recognizer API should not mention nether STATE nor a set of magic values like {0, -1, -2}.

Concerning your mode -2 — I believe, the standard word postpone doesn't need an own mode. But in postponing mode, if any, string literals (like s" foo bar") and comments should be properly treated.

ruv [r875] 2022-09-08 23:54:34

It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API.

See a block-based illustration of this idea in my Gist.

ruv [r876] 2022-09-09 11:47:13

: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

In the reference implementation you keep the mode {0,-1,-2} in the state variable. But it's problematic, regardless how the mode is passed into token translators (directly via the stack, or indirectly using a dedicated method).

Since, when interpretation state is set by [ (so state is set to 0), and then compilation state is set by ], the mode should be the same as before [. If it was -2, it should be set to -2. But information that the mode was -2 is lost. So another variable should be used to keep a flag whether "POSTPONE" mode is active or not.

Actually, the mode of compilation/interpretation and "POSTPONE" mode are not mutual-exclusive. They can be set independently of each other.

For example, the code:

: foo postpone bar [ postpone baz ] ;

conceptually can be pretty clear defined (see my comment). In this fragment, for the lexeme bar "POSTPONE" mode is active in compilation state, for baz "POSTPONE" mode is active in interpretation state.

So, if "POSTPONE" mode is employed, a different variable for it should be used for this reason too.

On the other hand I'm not convinced that we need "POSTPONE" mode at all. Except to implement the word postpone itself, where and how this mode can be used? Even for the questionable construct ]] ... [[ the mode "POSTPONE" isn't needed.

BerndPaysan [r877] 2022-09-09 11:59:51

OK, the most convincing argument is that STATE can go away as specified thing. You can use and combine system translators, and you can create table-driven translators, but STATE is an implementation detail.

BerndPaysanNew Version: minimalistic core API for recognizers [r879] 2022-09-09 21:32:21

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.

REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.

Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  2 cells + @ execute ;
: translate-comp ( translate-xt -- )  cell+ @ execute ;
: translate-post ( translate-xt -- )  @ execute ;

Stacks TBD, copy from Trute proposal.

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r880] 2022-09-10 09:59:58

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-word ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

THING-TRANSLATOR ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. An ambiguous condition exists if the exception word set is not available.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Stack library

: STACK: ( size "name" -- )
  CREATE 1+ ( size ) CELLS ALLOT
  0 OVER ! \ empty stack
;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize   ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP NOTFOUND
;
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  dup stack: dup cells negate here + set-stack
  DOES>  recognize ; 
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r881] 2022-09-10 15:03:52

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognizer

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r882] 2022-09-10 15:38:36

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r883] 2022-09-12 14:45:06

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that performs or compiles the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in interpretation state

TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in compilation state

TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
  case state @
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      nip \ do nothing if state is unknown; possible error handling goes here
  endcase ;
: translate-num ( n -- )
  case state @
      -1 of   lit,  endof
  endcase ;
: translate-dnum ( d -- )
  \ example of a composite translator using existing translators
  >r translate-num r> translate-num ;

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: get-forth-recognize ( -- xt )
  action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- )  >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- )  >body cell+ @ execute ;
: translate-post ( translate-xt -- )  >body @ execute ;

Defining translators

Once you have TRANSLATE:, and the associated invocation tools, you shall define the translators using it:

: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtl [r884] 2022-09-13 17:45:08

It seems to me that, given the reference implementation

' translate-nt translate-int
' translate-num translate-int
' translate-dnum translate-int

does not work (nor with translate-comp nor translate-post). Assuming you solve this, do you really want me to define, e.g.,

:noname ['] translate-nt translate-int ;

to get an xt equivalent to one of the xts that has been passed to translate:?

How do you implement POSTPONE (IIRC Matthias Trute has a reference implementation for that)?

What problem is solved by making all the translators state-smart? The problem I see is that you can only access the individual actions by saving state, setting state, executing the translator, and restoring the state. That's not a good design.

The specification of translate: mentions a "current mode". Where do I find out what a "mode" is? This is non-standard terminology.

BerndPaysanNew Version: minimalistic core API for recognizers [r890] 2022-09-15 14:56:44

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Translate as in interpretation state

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Translate as in compilation state

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Translate as in postpone state

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r891] 2022-09-15 15:02:45

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r892] 2022-09-15 15:10:01

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT

Get the interpreter xt from the translator

COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT

Get the compiler xt from the translator

POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT

Get the postpone xt from the translator

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r1034] 2023-08-08 01:11:13

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token; system component you can use to construct other translators of.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number; system component you can use to construct other translators of.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

BerndPaysan [r1035] 2023-09-11 20:47:25

I removed the access words to the xts for a reason: We don't do it that way in Gforth, and we actually found little use of those words.

There are (at least) three ways to create an interface to these operations:

Field-like, i.e. you can read and write the xts for interpretation/compilation/postpone. The typical usage is ( translator ) TRANSLATE-<state> @ EXECUTE.
Valuefield-like, as in the original Trute proposal. You can read the xts for interpretation/compilation/postpone without an extra @, but you can't write them, unless your access word also implements TO. The typical usage is ( translator ) TRANSLATE-<state> EXECUTE.
Deferfield-like, which is what Gforth does. Here, not only the @ is part of the operation, but also the EXECUTE (really a tail-call variant of it). You can neither read nor write the xts, unless the access words also implement IS and ACTION-OF. The typical usage is ( translator ) TRANSLATE-<state>, and that looks about right. Gforth uses different names to not collide with the proposal here.

Gforth offers as an extension to add more states and thus more access words, and that extension also adds IS, TO (which are synonyms) and ACTION-OF to the existing (it is only one, only for postpone state you need it explicit) access word, and also implements the other two for interpret and compile, which are never used on their own. Of course when you add a new state, you need to specify what existing translators do on that state, so IS becomes necessary, and ACTION-OF just comes for free through Gforth's way of implementing TO and variants, of which ACTION-OF is one. This extension is non-standard, and not proposed here, it is used for creating obscured (“tokenized”) source code and reading name=value-style config files.

The experience so far is that outside of this extension, there's only one of those three access words needed at all, which is TRANSLATE-POSTPONE, and it is exclusively needed inside the standard word POSTPONE itself, a word where the implementation is left up to the system anyways. So the usage of these words is extremely limited. Therefore, I deleted them and suggest not to standardize these words, following the “don't speculate” rule and the topic of this proposal to make a minimalistic API, which contains only what's necessary. These words are of little use, and therefore there's no need to standardize them.

BerndPaysanNew Version: minimalistic core API for recognizers [r1036] 2023-09-11 20:58:32

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

AntonErtl [r1038] 2023-09-13 05:08:28

Making the translators depend on state is a bad idea. It means that everything using the translators becomes infected with this state-dependency. It also means that you cannot implement postpone or ]]...[[ as standard-compliant code (while, with state-independent translators you could).

Moreover, when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state before executing the translators, which is perverse. And in the case of colorforth-bw, again there is no standard way to set state to get the translator to perform xt-post.

BerndPaysan [r1040] 2023-09-13 10:32:51

The experience with the usage in Gforth (non-standard extensions excluded) shows that direct calls to translators with a specific state are limited to postpone, which is compile-only and therefore

: postpone ( "name" -- )
  -2 state ! parse-name forth-recognize execute -1 state ! ; immediate compile-only

is not generating surprises (postpone is expected to leave the system in compilation state after it has done its work). In Gforth, ]] and [[ are implemented by changing state, and for recognizing the super-immediate [[ a special recognizer is added to the stack which returns a translator that has a specific postpone effect that changes back to compilation state and drops the additional recognizer from the stack.

' noop dup :noname  ] forth-recognizer stack> drop ; translate: translate-[[

The state-dependent invocation is the 99.9% case for translators, and that includes ]] and [[.

The Forth outer interpreter depends on state (or a similar internal representation). The object that deals with the different actions depending on state is the translator.

The proposal allows you to implement other ways to access the individual methods of a translator, if you need them. It does not encourage anymore to use translators as building blocks for other translators, and we can add wording that only translators created by translate: are standard-conforming. Since there's little use for these other access methods, it does not suggest to standardize those.

BerndPaysanNew Version: minimalistic core API for recognizers [r1041] 2023-09-13 10:37:35

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE-independently, which only works on translators created by TRANSLATE: (e.g. for implementing POSTPONE), so any other way to define a translator is non-standard.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as xt1 .. xtn n.

TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT

Translates a number.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Testing

TBD

ruv [r1042] 2023-09-13 17:02:48

Anton writes:

Making the translators depend on state is a bad idea.

It's a subject of terminology. A translator depends on state by definition:

to translate a token: to interpret the token if interpreting, or to compile the token if compiling.

token translator: a Forth definition that translates a token; also, depending on context, the execution token for this Forth definition.

If you want a recognizer to return not a execution token, but some opaque identifier, I suggest to call it "descriptor".

token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.

token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value (i.e., a constant), or a token descriptor object itself.

BerndPaysan [r1043] 2023-09-13 20:49:55

As said, this is just about moving things around. There's little difference if you use translator-execute or execute on a translator as specified way to get from the translator to its state-dependent action. It's all system-dependent and hidden, and systems might implement it without even referring to STATE and only update STATE to reflect compilation and interpretation state and otherwise never look at it, and the way the system internally keeps its state can be completely different. The most obvious difference is that with translator-execute, you need another word.

The fact that some abstract data type is executable does not mean EXECUTE is the only way to operate on it. Recognizer sequences are executable in this proposal, and they still can be read out with and set by GET/SET-RECOGNIZER-SEQUENCE. So you can't just define them as colon definitions, you need to go through RECOGNIZER-SEQUENCE: to define them.

Though I don't propose to standardize this, the proposal also suggests to make word list ids executable, and put them together in a recognizer sequence called search-order. word list ids still have be used in other ways, e.g. to add new words to them, and the details are left to the system; but it is clear that they can't be normal colon definitions.

Not providing an abstraction like either translator-execute or execute, and instead putting it directly as state @ abs cells + @ execute into the outer interpreter is a really bad idea, because all details of the reference implementation in which this sequence works become then part of the standard. Other ways to implement it, which may have performance advantages, or not expose the postpone state in STATE would then not be allowed. The following implementation should be standard, too:

: do-translate ( translate-body -- ) 0 + @ execute ;
: state! ( state -- ) dup state ! abs cells ['] do-translate >body cell+ ! ; \ assume threaded code
: translate: ( int-xt comp-xt post-xt "name" - - )
  create swap rot , , ,
  does> do-translate ;
: [ 0 state! ; immediate
: ] -1 state! ;
: ]] 2 cells ['] do-translate >body cell+ ! ; immediate \ STATE left as is

How to recognize [[ is left as exercise to the reader, hint: a recognizer is a good idea, because it actually provides something that is executed at postpone time.

AntonErtl [r1046] 2023-09-14 06:27:22

By comparison, with the first version of this proposal postpone can be implemented like this:

: postpone parse-name forth-recognize -2 swap execute ; immediate

which would not contain non-standard usage like -2 state !, and it would also work in interpret state (not the most important feature, but a feature nonetheless). And ]] could also be implemented as a standard program.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters. Other uses may be rare, but they exist, and people may come up with more over time if we make the interface flexible enough. The proposal does not propose to standardize state-independent ways to get at the functionality. Therefore, if the proposal is accepted, they don't exist for standard programs, and therefore they are not counterarguments against the disadvantages of the proposed state-dependent-only translators. The fact that this state-dependence means that you cannot use rectypes/translators to build other rectypes/translators is another (minor) argument against the state-dependence.

Concerning having a state-independent rectype as an abstract data type, the first version of this proposal proposed that rectype is an executable word with stack effect ( i*x state -- j*x ) where state would be 0, -1, or -2. This does not expose anything about the internals, and even allows to define rectypes without using a special defining word. The invocation in the text interpreter is ( i*x rectype ) state @ swap execute, and in postpone it´s as shown above.

Alternatively, if the rectype is the address of some data structure, yes, we would need an additional word, maybe rectype-translate ( i*x rectype n -- j*x ) that performs the access to the data structure. The usage in the text interpreter would be ( i*x rectype ) state @ rectype-translate and the usage in postpone would be ( i*x rectype ) -2 rectype-translate.

ruv [r1047] 2023-09-14 07:00:33

Bernd writes:

The most obvious difference is that with translator-execute, you need another word.

Yes, essentially I agree concerning translator-execute and execute alternatives.

Yet another difference is that with translator-execute the Forth text interpreter (the outer loop) should know this additional word (probably it means more degree of coupling). But with execute — it should not know any additional word.

to make word list ids executable [...] but it is clear that they can't be normal colon definitions

Another example is defer-words (words created by defer), which are executable but are not normal colon definitions — defer! and defer@ can be applied to their xt.

The following implementation should be standard, too

The provided implementation for ]] is system dependent, namely it depends on implementation of Recognizers API. But, anyway, Gforth's ]] can be implemented in a standard way via postpone.

BerndPaysan [r1048] 2023-09-14 07:42:43

A translator is the address of a data structure, which also happens to be executable. This is not a contradiction! And there was a proposed standard way to access fields directly, renamed from the Trute proposal (but with otherwise identical, value-field like semantics) to INTERPRET-TRANSLATOR, COMPILE-TRANSLATOR, and POSTPONE-TRANSLATOR. The reason I deleted these is that we don't even use them in Gforth, we only use >POSTPONE, which has a different effect (it does not read out the xts, it executes it right away). If there is consensus that this is the right interface (not a value-field, but a defer-field), I can add this back to the proposal; as well as adding a standard way to set the state without knowing the internals of the system, for which the file recognizer-ext.fs in Gforth also provides a suggestion:

: translate-state ( translator-access-xt -- )
    \ takes a translator access xt, and may check if that actually is one
    >body @ cell/ negate state ! ;

The hypothetical more performant implementation in Reply 1043 would have a different translate-state, which would contain something like

>body @ ['] do-translate >body cell+ !

and only change STATE for interpret/compile.

This proposal is minimalistic on purpose and does not cover all corner cases, especially not those where no consensus has been reached yet.

I consider the magic number dispatch method proposed earlier as not appropriate: this is tied to a specific implementation, and not a good interface. Method invocation or field access should be done by named access words, not by numbers.

ruv [r1049] 2023-09-14 08:00:39

Anton writes:

postpone can be implemented like this

postpone can be implemented in any variant of the Recognizer API, with more or less code.

A difference is whether the behavior of postpone can be extended/changed without redefinition of postpone.

My point: if users need to extended behavior of postpone without redefinition, then a special method can be specified for that. OTOH, postpone (and ]]) is a poor man's "postponing mode". An example of a more convenient tool is my c-state PoC, which provides a better tool for users, and it even supports any new user-defined special words.

I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters.

It's not an argument, since the API can provide words like compile-token, execute-token, postpone-token, having ( i*x xt.translator -- j*x ) or ``( ix rectype -- jx )`, which are state-independent and don't restrict usage in the mentioned way.

BerndPaysanNew Version: minimalistic core API for recognizers [r1080] 2023-09-15 07:51:05

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
2023-09-15 Add list of example recognizers and their names.

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.

REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.

REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words: * ->value as TO value or IS value * +>value as +TO value * '>value as ADDR value * @>value as ACTION-OF value xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${name} and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.

REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns translate-complex on success.

Testing

TBD

BerndPaysanNew Version: minimalistic core API for recognizers [r1081] 2023-09-15 08:18:00

Show differences

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
2023-09-15 Add list of example recognizers and their names.

Problem:

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating a special implementation

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: rec-xt ( addr u -- translator )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] notfound  THEN ;

then you should factor the part starting with state @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: rec-xt ( addr u -- ... translator )
  here place  here find dup IF  [']  translate-xt
  ELSE  drop ['] notfound  THEN ;

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND):

REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )

XY.3 Additional usage requirements

XY.3.1 Translator

translator: subtype of xt, and executes with the following stack effect:

TRANSLATE-THING ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND if not.

NOTFOUND ( -- ) RECOGNIZER

Performs -13 THROW. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT

Create a translator word under the name "name". This word is the only standard way to define a translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

XY.6.2 Recognizer Extension Words

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

Rationale:

FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:

RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT

SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT

Obtain the recognizer sequence xt-seq as n*xt n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token.

TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT

Translates a number.

TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT

Translates a double number.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number.

TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT

Translates a string.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognize execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator / notfound )
  forth-wordlist find-name-in dup IF  ['] translate-nt  ELSE  drop ['] notfound  THEN ;
: rec-num ( addr u -- n num-translator / notfound )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop ['] notfound  THEN ;

: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

Extensions reference implementation:

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP ['] NOTFOUND <> IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;

Once you have recognizer sequences, you shall define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute ['] notfound = IF  0  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.

REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO-like operations of value-like words:

->value as TO value or IS value
+>value as +TO value
'>value as ADDR value
@>value as ACTION-OF value

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

Testing

TBD

BerndPaysan [r1082] 2023-09-15 14:03:42

Things to discuss, because there are still too many variables.

ToDo:

Rename Recognizers from REC-result to RECOGNIZE-result. A solution for .RECOGNIZERS drowning the reader in recognize- could be to skip that prefix, because all recognizers are supposed to have the same prefix, anyways.
Revert the name of translators to rectypes or some similar word showing that this does describe a type?
Add mode/state-specific access words to the translators again and decide on how they work. I prefer defer-field likes, which right away execute the corresponding action, and not put an xt on the stack for consumption. Defer-fields could work together with IS and ACTION-OF to access the xts within (in Gforth, they do).

Answers to some questions:

A lot of thoughts went into it to make different subsets of this proposal useful on their own, and allow different implementation strategies. The answer to “can I do without feature X” is most likely yes. You can use the subset of the features you want. Stripping away too much results in a subset no longer usable.

Opening up the whole idea to small systems is useful to gain wider use.
FORTH-RECOGNIZE is a deferred word in the reference implementation on purpose, and that allows changing it without adding more words. To add more implementation options, you can use the setter and getter words (which are optional) if you don't want to implement it as deferred word to swap in and out named sequences.
The recognizer sequences do have words to get and set the sequence, so you can just work with a single sequence and set/get it if you like. The nesting capability comes by the magical fact that a recognizer sequence has the same stack effect as a recognizer.
You can do without both, because recognizer sequences can be written as colon definitions “by foot”.
Named sequences are useful, especially when you swap in recognizer sequences for applications that do something completely different than the Forth recognizer sequence. If you do not want to support named sequences, you can still provide the one single named sequence FORTH-RECOGNIZE, and allow SET-RECOGNIZER-SEQUENCE and GET-RECOGNIZER-SEQUENCE to operate just on that. That's also an option where recognizers are useful without having FORTH-RECOGNIZE being deferred and no RECOGNIZER-SEQUENCE:.
The NOTFOUND return for failure is there so that you can always EXECUTE the result of FORTH-RECOGNIZE and don't have to check for errors there.

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later. Actually, parsing should happen in PARSE-NAME. It still seems to be a hack that doesn't have a perfect solution.

ruv [r1084] 2023-09-16 00:38:53

Rename Recognizers from REC-result to RECOGNIZE-result

In general, an abbreviation or acronym may be acceptable to me. But in this case I prefer RECOGNIZE- rather than REC-. The main disadvantage of rec if that it has misleading associations. And the main advantage of recognize is that it's a whole English word that is very appropriate for our case.

The part referred as "result" should not be a result (of recognizing), but the expected type of the input lexeme. Have a look in your examples — REC-NUM and REC-TICK produce the same result type translate-num, but they accept different types of input lexemes, and these types are identified by NUM and TICK symbols correspondingly.

Thus, the naming form for recognizers can be expressed as RECOGNIZE-{lexeme-type-symbol}.

Revert the name of translators to rectypes or some similar word showing that this does describe a type?

It does describe a type of what? It describes a type of a token i*x, which is a result of recognizing. Actually, a token translator identifies the type of a token i*x, which is a result of recognizing. Then, a token translator is a token type in the same time.

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type. Then, token translators can be named according to the form TT-{token-type-symbol}. It looks elegant to me.

The names of translators are used for two purposes: to call a translator (for example, when we define a new translator via existing translators), and to obtain xt of a translator (which is an identifier for a token type in the same time) — to analyze a result of recognizing. The prefix tt- looks good in these both case.

ruv [r1085] 2023-09-16 01:08:02

An example of use translators for two different purposes:

\ use "tt-lit" and "tt-2lit" just to call these token translators:

: tt-3lit ( 3*x -- 3*x | )
  >r tt-2lit  r> tt-lit
;

: recognize-forth-lexeme ( sd -- i*x tt ) forth-recognizer execute ;


\ use "tt-xt" to analyze a token type:

: recognize-tick ( sd -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  ['] recognize-forth-lexeme execute-balance2 ( i*x tt|0 n.data-stack n.float-stack )
  2>r dup ['] tt-xt = if 2rdrop exit then drop 2r> fndrop ndrop 0
;

In this implementation for recognize-tick (not tested), the phrase 'foo::bar::baz will work correctly and returns xt of the word baz in the wordlist bar in the wordlist foo, when recognize-pqname for the syntax "::" (example) is a part of forth-recognizer.

To implement this, we do a nesting call of the forth recognizer for another lexeme and then analyze the returned type. If the returned type is not appropriate, we drop the token (from the data stack, and from the floating-point stack, if any). So we need to be sure that calling recognize-forth-lexeme never causes any side effect (other than stacks), even when recognizing succeeds.

NB: when recognize-tick is a part of the current forth-recognizer, executing of recognize-forth-lexeme on some inputs will produce indirectly recursive call of recognize-forth-lexeme (as intended).

ruv [r1086] 2023-09-16 02:12:54

Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING no longer has the corresponding string on the stack, but needs parsing it later.

It's pretty allowed for a translator to parse the input buffer and/or read the input stream. Some token translators will even do nesting calls of the Forth text interpreter and can throw exceptions.
A problem that a part of the string can be in the input buffer (or even in the input stream) is solved via introducing two translators for strings: one accepts the full string from the stack (e.g. tt-slit), and another (e.g. tt-slit-parsing) accepts the starting part from the stack, and the tail from the input buffer (or input stream). The string recognizer returns one or another depending whether a lexeme is a completed string, or the start of the string only.

I published a reference implementation in 2019, and now updated it for the current proposal.

A string recognizer can be as follows:

: quot ( -- sd.quot ) s\" \"" ;

: recognize-string ( sd.lexeme -- sd tt.slit|tt.slit-parsing | 0 )
  quot match-head 0= if 2drop 0 exit then quot match-tail if ['] tt-slit exit then
  2dup quot contains if 2drop 0 exit then \ fail if '"' is found in the middle of the string
  ['] tt-slit-parsing
;

BerndPaysan [r1087] 2023-09-16 21:32:13

The code which I have simply looks like this:

['] translate-string of  json-string!           endof

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

Thinking a bit more about that, I found:

The thing you want to nest is the translator for names
As I said, names should be first, numbers second and the rest third
We have nestable recognizer sequences

So one solution would be to put all recognizers that return nts+translate-nt (or variants of that, e.g. locals have a variant of translate-nt that differs for postpone) in one recognizer stack, which has a name, and can be called without calling the entire recognizer stack. These recognizers have now a predictable effect, and no side effect. Since you can't tick locals, you still have to check for translate-nt, but that's ok. You don't have to go through all weird other recognizers.

In Gforth, .recognizers now can handle and display nested recognizers, and if you split this up like that, it would output:

.recognizers  ~names ( ~nt ( Forth Forth Root ) ~scope ) ~numbers ( ~num ~float ) ~others ( ~string ~to ~dtick ~tick ~body ~complex ~env ~meta )

The ~ is there to abbreviate recognize- (or rec- now).

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

The other solution is what Gforth does: There's a ?REC-NT which does the nesting, the checking for translate-nt, and the cleaning up of the side effects (stacks and >IN). There is the possibility to make this more generic, e.g. create a word TRY-RECOGNIZE which gets an xt, passes that to the result, and if that returns false, everything is cleaned up and false is returned, otherwise whatever that xt left (including the flag) is returned.

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

ruv [r1088] 2023-09-17 16:31:14

One correction. I wrote:

If we want to reflect this idea, we can use the acronym tt, which stands for both: token translator and token type.

It should be read as:

If we want to reflect this idea, we can use the acronym tt, which stands for both: "translate token" (verb) and "token type" (noun).

Data type symbol

To specify formal requirements, we have to introduce a new data type for token translators, which is a subtype of xt. And the abbreviation tt is a good candidate for this data type symbol.

If we will have the data type tt => xt|0, and the symbol sd for the string data type, the naming convention along with the stack diagram for a recognizer can be expressed as:

RECOGNIZE-{lexeme-type-symbol} ( sd.lexeme -- i*x tt ) ( F: -- j*r )

ruv [r1089] 2023-09-17 23:29:13

@BerndPaysan writes:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases but single-line string literals. But it's impossible.

For example, a recognizer for multi-line string literals cannot parse the full string literal without refilling the source (see my PoC implementation). Should we also restore the input source state to isolate side effects of recognizers?

And it still isn't enough. A recognizer for curly-based markup like foo{ any forth code bar{ nested code }bar ... }foo cannot return something useful, since a useful thing in this case is a created definition or just a side effect of appending some semantics to the current definition. Should we also restore the state of the dictionary?

I think, it's obvious — isolation of all possible side effects of recognizers is not fruitful.

Yes, some recognizers returns objects that are not useful by themself, but they still return information what a given lexeme means, and it's an acceptable price for absent side effects for all recognizers.

Also we separate concerns into things that do have side effects (token translators) and things that don't have side effects (recognizers). It's very useful separation.

ruv [r1090] 2023-09-17 23:52:08

@BerndPaysan writes:

The code which I have simply looks like this:
['] translate-string of  json-string!           endof

A straightforward solution is to handle each token type of string literals separately. Probably, I would write it as follows:

  'tt-slit           of                  json-string!    endof
  'tt-slit-parsing   of  parse-slit-end  json-string!    endof
  'tt-slit-ml        of  parse-slit-ml   json-string!    endof

(I would use a recognizer for a leading tick, and naming of translators in the form tt-{token-type-symbol})

Or I would factor a helper word as follows:

: ?prepare-tt-slit ( i*x tt -- i*x tt | sd.transient tt.slit )
  case
    'tt-slit           of                  'tt-slit endof
    'tt-slit-parsing   of  parse-slit-end  'tt-slit endof
    'tt-slit-ml        of  parse-slit-ml   'tt-slit endof
  endcase
;

: eval-json ( .. tag -- )
  ?prepare-tt-slit case
    ...
    'tt-slit           of                  json-string!    endof
   ...
  endcase
;

ruv [r1091] 2023-09-18 01:21:10

Multiple entry points for the Forth recognizer

@BerndPaysan writes:

This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.

Yes, I also consider such a solution. It's a convenient solution to implement the default Forth recognizer.

But requiring the Forth recognizer to always conform this particular structure of recognizer sequences, and even always be the same instance of this structure, is too restrictive.

And otherwise you don't know the id of the actual names recognizer sequence (and even don't know whether such a sequence exists), and so you cannot check a lexeme against only this sequence (I mean, in implementation of recognize-tick).

Filter recognizer results

Bernd, your word try-recognize is a good factor to filter results, regardless side effects (beyond stacks). Having recognizers without side effects, it can be also implemented in a portable way.

If this word filters for a single token type, it's better to pass a corresponding tt directly (instead of xt.filter).

If this word allows to filter for multiple token types (I assume this variant), it should not drop tt from the stack.

Also, to be more useful, this word should not be bound to the current Forth recognizer only. Then, this word can be called as

apply-recognizer-filter ( sd.lexeme xt.recognizer xt.filter -- i*x tt | 0 )`.

A usage example:

: recognize-forth-name ( sd.lexeme -- nt tt.nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter
;

: find-forth-name ( sd.lexeme -- nt | 0 )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  if exit then 0
;

: find-forth-name? ( sd.lexeme -- nt true | false )
  forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter  0<>
;

: recognize-tick ( sd.lexeme -- xt tt.xt | 0 )
  "'" match-head 0= if 2drop 0 exit then  ( sd2 ) \ the input lexeme without the leading tick
  forth-recognizer [: dup 'tt-xt = ;] apply-recognizer-filter
;

The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.

Yes, but, as I show, >in is not enough. Also, it's better to avoid such special cases in general.

ruv [r1112] 2023-09-29 08:06:21

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER, too.

There is no way for a program to check whether it can apply TO to FORTH-RECOGNIZER, or FORTH-RECOGNIZE, or RECOGNIZE-FORTH-LEXEME, etc. Thus, TO cannot be optional. And it cannot be mandatory too. Thus, TO cannot be a part of the API at all — neither RECOGNIZER, nor RECOGNIZER EXT.

Then the getter and setter should be a mandatory part of the API.

ruv [r1113] 2023-10-01 22:54:20

In continuation to the message:

Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.

This would be a valid argument if it were possible to return something useful from a recognizer in all use cases

Another example of an unuseful token is the result of the mentioned recognizer REC-TO, which recognizes a syntax like ->foo.

It's too restrictive to require this token be ( xt n tt ), since in some systems it can be just ( xt.set-value tt ), in other — ( addr.data-field xt.store tt ). This means that this token is not something useful to a program at all (apart of translation).

GeraldWodni [r1325] 2024-09-26 13:55:18

The committee thanks the authors for all the work. Here is the timetable:

Everybody interested in this proposal: please submit your comments by end of October.
Bernd (main author): please work this into a new version by the end of the year (2024).
The committee will have a special interim meeting for this very proposal in February (final date will be announced in mattermost)

Considered

BerndPaysan [r1332] 2024-09-26 15:33:13

Concerning the setters and getters: I would prefer to make it mandatory that FORTH-RECOGNIZE actually is a deferred word, and drop the additional getters and setters completely. DEFER, IS, and ACTION-OF are all CORE EXT; so if you implement the recognizers, you have a dependency on those. The previous proposals had VALUE and TO and interface, which is also CORE EXT.

Gforth could support IS and ACTION-OF on recognizer sequences, too (i.e. assign n elements in order), through its polymorphous approach at all those words for value-style words (TO, +TO, ADDR, IS, ACTION-OF all can do different things on different classes of values), but I guess that would be too much.

Can those setters and getters be optional in case you don't want to support DEFER, and how can a program be written to work in both cases? If you have TOOLS EXT available, you can use

[DEFINED] is [IF]
    is forth-recognize
[ELSE]
    [DEFINED] to [DEFINED] forth-recognizer and [IF]
        to forth-recognizer
    [ELSE]
        set-forth-recognizer
    [THEN]
[THEN]

Yes, this is ugly and shows that having different options is not a good idea.

For the reworked proposal, I will need to restructure the proposal in a way that optional parts I rather want to remove are outlined as such, so that the final rewrite is easy.

ruv [r1351] 2024-10-08 08:43:25

Deferred words in API considered harmful

make it mandatory that FORTH-RECOGNIZE actually is a deferred word

As we have discussed, the main problem with a deferred word is that it can't be redefined by wrappers that have additional actions when setting or getting the value. In this respect, such a word in an API is as bad as an address-flavoured variable (like BASE).

There is also a recent discussion in comp.lang.forth (link) under subjects "value-flavoured approach" and "value-flavoured structures".

Special data object on failure considered harmful

A question is what to return on failure (unsuccess): a special data object (xt of notfound) or a common data object 0 (zero).

Below is a copy of my rationale from 2023, with some rewording.

There are two strong arguments against a special data object:

consistency with other similar words;
impact on the overall lexical size of programs.

Consistency

Many standard words returns some data object on success, or 0 (zero) on unsuccess/failure. This is possible because this data object cannot be 0.

For example:

name>interpret ( nt -- xt | 0 )
find-name ( sd.name -- nt | 0 )
find-name-in ( sd.name wid -- nt | 0 )
find ( c-addr -- xt n | c-addr 0 )
search-wordlist ( sd.name -- xt n | 0 )
source-id ( -- fileid | -1 | 0 ) — not a fail, but also an example when zero was chosen instead of a special object.

Also, it is a common approach in practice. This allows common high-order functions operates on the common failure result 0.

Why should not recognizers follow this practice? Why should they return a special id on failure rather than zero?

Lexical code size

Returning notfound on failure makes the code shorter (in terms of lexemes) in some places. But the point is that it makes code longer in more places.

I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of a Recognizer API. In its code:

['] notfound with = or <> is used 10 times, and without checking — 32 times.
forth-recognize execute is used 3 times.

If we use 0 (zero) instead of the notfound xt, then:

['] notfound <> is removed 5 times, which eliminates 15 lexemes;
['] notfound = is replaced with 0<> 5 times, which eliminates 10 lexemes;
['] notfound is replaced with 0 32 times, which eliminates 32 lexemes;
the definition for notfound is removed, a definition for ?found is added: : ?found ( x.some\0 -- x.some | 0 -- never ) dup 0= -13 and throw ;, which adds not more than +3 lexemes;
forth-recognize execute is replaced with forth-recognize ?found execute 3 times, which adds +3 lexemes;
the word ?found can be also used after find, search-wordlist, find-name, find-name-in — when the user needs to execute their result at once, and unsuccess should produce an exception.

Thus, replacing of notfound by zero reduces the overall lexical code size in Gforth by more than 51 lexemes, which is more than 0.4KiB in absolute size (as on 2023-09-17).

So why should we prefer an approach that increases the overall lexical size of programs?

AntonErtl [r1352] 2024-10-31 18:45:59

About the proposal text

The "Problem" section does not describe a problem of Forth-2012 that the proposal wants to solve, but considers a problem with some other recognizer proposal. Similarly, the "Solution" section refers to some other recognizer proposal. This makes these sections useless for readers who have not first read up on the other proposal, which is not even linked here. Parts of the "Solution" section might be useful in another section on transitioning from the earlier proposal.

Instead, the "Problem" and "Solution" sections should describe what benefits this proposal adds to the standard, and how. A possible "Discussion" section and its subsections should describe the benefits of the present approach over possible alternative approaches (if that's too detailed, lazy system implementors will complain about the length of the proposal, but some complaints should just be ignored).

"Typical use" should of course be presented.

State-dependence

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words. This makes the translators hard to use anywhere except in INTERPRET; the proposed-for-standard interface is even hard (actually impossible with standard means) to use in POSTPONE, which is an intended user of translators, as the proposal admits itself:

POSTPONE can do that without a standardized way

Another problem with the state-dependent translators is that it leads to either handwaving specifications of what they do, as evidenced in XY.3.1:

TRANSLATE-THING ( jx ix -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

in the non-specification of what translator-xt does in FORTH-RECOGNIZE and the handwaving specification of "name:" in TRANSLATE:

"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.

and the nonspecification of what TRANSLATE-NT, TRANSLATE-NUM, TRANSLATE-DNUM, TRANSLATE-FLOAT, and TRANSLATE-STRING do.

Or if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

If you really believe that the state-dependent approach is a good idea, please specify all these words exactly; the editor won't do it for you.

Opaque solution

If there is no need to make POSTPONE implementable in a standardized way, there is no need to make INTERPRET (which is not even standardized) implementable in a standardized way, either, and the translators can become a completely opaque thing that the standard does not document. In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token, and standard programs can only use that for implementing recognizers, but not for implementing text interpreters, POSTPONE, or anything else.

Transparent solutions

Alternatively, we might heed "Don't bury your tools!" and have a more useful interface for translators, like what we have seen in earlier drafts and other recognizer proposals.

POSTPONE

If the idea of the proposal is that xt-post is actually used by POSTPONE, the proposal should specify the change to POSTPONE.

Standardize recognizers

I expect that more people will want to compose existing recognizers into recognizer sequences than to define new recognizers, but they usually need to know about existing recognizers in order to do that. Therefore the proposal (or an accompanying proposal) should not just propose standard translators, but also standard recognizers.

ruv [r1358] 2024-11-02 22:37:33

@AntonErtl writes:

The proposal in its present form is unacceptable to me because it defines a defining word TRANSLATE: for state-dependent words, and expects recognizers to produce the xt of state-dependent words.

I do not like TRANSLATE: either, but for a different reason. Sometimes it is very convenient to define a translator as a quotation (right inside the recognizer), and if you are forced to define a translator only with TRANSLATE:, you cannot define it as a quotation.

This makes the translators hard to use anywhere except in INTERPRET;

Could you provide some examples, please? It seems, this is not harder than performing the observable interpretation semantics using the result of name>interpret.

BerndPaysan [r1359] 2024-11-04 01:16:31

Concerning explicit access methods to xt-int/xt-comp/xt-post, I can offer the following compromise, as a result of observations made:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

However, it turns out that you can access xt-post that way, because the only word that possibly changes that state is [[, and that token is a) no visible at all to POSTPONE, and b) changes the state back to compilation state, the state POSTPONE was in anyhow.

So if your system allows full explicit access to all three possible states, all translators have to be defined by 'TRANSLATE:', and I can offer you three access methods. If you only want to implement POSTPONE, the following definition actually works:

: postpone ( "string" -- )
  parse-name forth-recognize ?found
  state @ >r -2 state ! execute r> state ! ; immediate

Further observations:

Gforth has >INTERPRET and >COMPILE, and doesn't use them, only >POSTPONE is used. In exactly one place, in POSTPONE. All other invocations are through EXECUTE or only taking the data. The rest is implementation, including the extension towards more of those access methods for more, user-defined states. The question is whether you need to standardize a tool that has no use case, even if you don't bury it.

A possible way to deal with this is to move this out to a separate proposal.

What has been quite useful is the EXECUTE interface for user-written interpreters, because these are interpret-only, and don't need the complication of state-dependent translators at all.

BerndPaysan [r1361] 2024-11-04 11:21:24

Sleeping over it added a few ideas:

The invocation through changing STATE and restoring it works (in general) for translators that will definitely not change STATE as part of their own operation, e.g. translators for literals. It also works (as a special case) for POSTPONE, so a standard implementation of POSTPONE using that method is possible. The postpone mode itself, which needs to change STATE at [[ relies on the dispatch through STATE without setting and restoring STATE around the invocation, so it also works.

The question here is not if that implementation is a quality implementation, but whether it's not so bad that it is another bag full of inconsistencies. IMHO, TRANSLATE-NT will have demonstrable inconsistencies when not using the clean TRANSLATE: interface, but combined literal translators won't. For the cleaner interface outside of POSTPONE itself (which is special case enough to not require the cleaner interface), we have to demonstrate that there is an actual use case. So far, we don't have one.

Both POSTPONE with the additional functionality and the postpone mode ]] … [[ will become part of the proposal.

ruv [r1363] 2024-11-05 01:26:19

@BerndPaysan writes:

It turns out that you can not access xt-int and xt-comp by setting STATE, executing the translator, and then reverting STATE to the value before, because words can change STATE as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.

This is wrong. Yes, the state change can be a desired result of interpretation or compilation semantics, but this does not prevent us from performing the interpretation or compilation semantics regardless the initial value of STATE, as I shown many times.

We can use the following helpers for that.

\ Useful factors
: compilation ( -- flag )  state @ 0<> ;
: enter-compilation ( comp: false -- true  |  comp: true -- true )  ] ;
: leave-compilation ( comp: true -- false  |  comp: false -- false )  postpone [ ;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in interpreted state.
: execute-interpreting ( i*x xt -- j*x )
  compilation 0= if execute exit then
  leave-compilation execute enter-compilation
;

\ For the execution semantics identified by xt,
\ perform the part that can be observed in compilation state.
: execute-compiling ( i*x xt -- j*x )
  compilation if execute exit then
  enter-compilation execute leave-compilation
;

If we have a result of recognizing with the xt of a translator at the top (i.e., a fully qualified token), and we want to perform the corresponding interpretation semantics regardless of the current value of STATE, we should execute this xt with execute-interpret. If we want to perform the corresponding compilation semantics regardless of the current value of STATE, we should execute it with execute-compiling. If we want to perform the semantics according to STATE, we should just execute this xt with execute.

The key point in the implementation of execute-interpreting and execute-compiling is that we do not save/restore STATE if it matches the semantics we want to perform — and if changing STATE is part of the semantics, STATE will be changed. On the other hand, if STATE does not match the semantics we want to perform, we change STATE and then restore it — if changing STATE is part of the semantics, then it will change STATE to the same value that was saved and to one we restore it to. Thus, the resulting STATE will be as expected!

NB: execute-interpreting and execute-compiling are also required if we want to perform the interpretation semantics or compilation semantics from an nt, regardless the current value of STATE. Moreover, these words are required even in the old approach for Recognizer API, which provides the words RECTYPE>INT and RECTYPE>COMP — because these words have the same flaw for state-dependent words as NAME>INTERPRET and NAME>COMPILE.

ruv [r1364] 2024-11-05 03:29:18

@AntonErtl writes about token translators:

if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.

This is reasonable. And we also discussed in the Recognizer chat group that the standard does not imply such a state as postponing (for the Forth text interpreter).

In my opinion, these problems can be avoided.

We should specify "to translate a token" and "token translator" in the common sections of term definitions, data types and usage requirements. Then, we do not need to repeat that for every token translator. It will be enough to specify that a word is a token translator, and the data type of the token (that it translates).
We can have a word like postpone-token ( qt -- ) that append the compilation semantics of a lexeme, which was recognized as qt, to the current definition. (qt is a qualified token, which is a pair of an unqualified token and token translator ( uq tt ))

So, any additional state, if any, is encapsulated into postpone-token. The standard should not specify it.

Thus, postpone can be defined like this (in my parlance):

: postpone ( "name" -- )
  parse-lexeme perceive ?found postpone-token
;

How postpone-token finds/performs the postponing action from tt — it's an internal problem of implementation. The word postpone-token should throw the exception -32 "invalid name argument" if a postponing action is not associated with tt.

We need to provide a way to associate a postponing action (an xt) with a tt, or to create a new tt from an xt and tt. The postponing action should be optional. The user needs to provide a postponing action only if they want to make postpone applicable to the corresponding lexemes.

For example, we can have an optional word postponable ( tt1 xt.postponing -- tt2 ). Probably, this word shall return the same tt2 for the same input pair ( tt1 xt.postponing ). This word is optional, because it can be implemented along with postpone-token in a standard program, and postpone can be redefined to use then.

ruv [r1369] 2024-11-13 09:32:32

@AntonErtl writes:

In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token,

I researched this approach.

In general, a recognizer returns a qualified token (qt) on success, where a qualified token is a pair of an unqualified token (uq) and a token descriptor (td).

Data type relations:

unqualified token: ut => ( S: i*x F: k*r )
token descriptor: td => x\0
qualified token: qt => ( ut td )

It is always possible to define a word translate-qtoken ( any qt -- any ), which translates a qualified token (i.e., performs the interpretation or compilation semantics for the corresponding recognized lexeme depending on STATE). And as practice shows, it is very useful and in demand.

Additionally, in Forth, it is always technically possible to make the token descriptor also a token translator (that is a subtype of the execution token), without any loss (see an example).

token translator: tt => xt ; td = tt

So, instead of using a separate word translate-qtoken, we can use the word execute. And the Forth text interpreter simply executes the token translator (instead of applying translate-qtoken to qt). Note that regardless whether the token descriptor is a subtype of the execution token, the token descriptor is opaque for the Forth text interpreter. The only difference is whether translate-qtoken or execute is using by the Forth text interpreter.

The big advantage of token translators is that they can be defined inline as quotations, and they can be used to define other token translators. This simplifies programs and reduces the lexical size of programs.

Also, token translators allow us to define dual-semantics words simpler. For example, this is a definition for ['], which has the expected interpretation semantics:

: ['] ( -- xt | ) '  tt-xt ; immediate

See also in my gist the word missing(, which has the expected interpretation and compilation semantics. Without token translators such words are more difficult to implement.

AntonErtl [r1378] 2024-11-29 18:18:40

Concerning the supposed lack of use cases: I have mentioned use cases where the state-based interface is at the very least cumbersome in r1038.

state is a bad idea, as demonstrated by the problems mentioned above. We are stuck with it for the existing system, but we must not put state in new interfaces, much less in new defining words.

As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface. And it must not be state-dependent.

ruv [r1385] 2024-12-01 17:52:37

Use cases

Concerning the supposed lack of use cases: I have mentioned use cases where the state-based interface is at the very least cumbersome in r1038.

Do you mean this example: "you cannot implement postpone or ]]...[[ as standard-compliant code"? Then it's unclear what specification for these words you cannot implement? Because:

You provided a portable (standard compliant) implementation for ]] ... [[ (based on postpone). This implementation does not depend on how postpone is implemented.
A standard postpone can be implemented using find or find-name. An advanced postpone can be also implemented in a standard-compliant way.

Could you please clarify?

Probably you mean that the user should be able to create a recognizer and assign it to the perceptor, and then postpone (and ]] ... [[ that uses this postpone) should be applicable to lexemes that this recognizer recognizes. But I do not see any connection to the state-based interface too.

state is a bad idea, as demonstrated by the problems mentioned above.

I implemented postpone in four different approaches (see fep-recognizer/implementation/variant.gamma/postpone/index.fth) in my "gamma" reference implementation for Recognizer API.

This reference implementation is portable and can be loaded in Gforth as

gforth implementation/index.fth

In every approach I defined the interpretation semantics for postpone, so postpone depends on state. In every approach the words compile-postpone-qtoken ( qt -- ) and translate-postpone-qtoken ( any qt -- any ) are provided. The former does not depend on state, the later does depend on state.

In the variant postpone/auto.via-mmode.fth the macro-compilation mode is employed (one more state, if you like). By default, namely this variant is loaded in the current version (Commit f3b7d01). The macro-compilation mode is very useful because it also allows to implement a more useful and advanced variant than your construct ]] ... [[.

Could you please demonstrate a problem concerning state-dependency in any of these approaches?

As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface.

Could you please provide a practical example when you need a transparent token descriptor structure?

BerndPaysan [r1388] 2024-12-02 02:30:07

Using EXECUTE instead of a special translator-specific word allows to use the rest of the recognizer API for interpreters that don't have any state at all. This actually happens and is useful; e.g. the parser in net2o's chat system uses that. There's absolutely no need for any other mode than directly interpreting. And using EXECUTE does not mean you have to set STATE if you call a translator for a particular state (interpreting/compiling/postponing) directly. Though there are likely confusing results if you do so and the word executed is a state-smart word. The amount of surprise level is likely small, because so far, the only direct access method actually useful is the one for the postpone action. And that never executes the word found.

I don't want to mandate a particular implementation. Choose the implementation you like. I'll add an API that allows direct and default invocation of a translator. I'm not sure if I want this in the same proposal or split it into another one, so we can vote on those separately.

ruv [r1390] 2024-12-04 01:37:18

@AntonErtl wrote in r1038

when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state before executing the translators, which is perverse.

It is not more perverse than repeating «_», «]», or «[» before each lexeme in a program ;)

In general, when the Recognizer word set is provided, the Forth text interpreter itself knows nothing about STATE. If you write a state-independent text interpreter, your recognizers should provide state-independent token translators. And you have not to set state at all. I rewrote your colorforth-bw example in Recognizer API. It just works. Note how translators are embedded into recognizers using quotations (in Commit 6c72064).

And in the case of colorforth-bw, again there is no standard way to set state to get the translator to perform xt-post.

In this approach, why do you need to write «[postpone _foo» instead of «]foo» ?

AntonErtl [r1397] 2024-12-07 07:22:49

@BerndPaysan:

If you eliminate the state-dependence of translators, then text interpreters that use more than just the xt-int action (e.g., the one for colorforh-bw, see below) can be written without having to deal with state. And text interpreters that use xt-post can be written using the proposed wordset rather than having to use a detour through postpone (which is a parsing word, possibly introducing additional complications).

The following is also relevant to @ruv:

Ruv's colorforth-bw implementation demonstrates the shortcomings of the present proposal, because it does not use recognizers nor translators at all for implementing recognize-colorforth-bw; instead, it reimplements everything that the name recognizer and the number recognizer already do internally, nicely demonstrating that the present proposal buries the tools. And it only implements dealing with names and single-cell numbers. Finally, the implementation is so long (44 lines without putting it into forth-recognize) that you have not shown it inline, but posted a link to github.

By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):

defer recognizer1 forth-recognizer is recognizer1

: prefix>index ( c -- n )
  case
    '[' of  0 endof
    '_' of -1 endof
    ']' of -2 endof
    1 swap
  endcase ;
  
: rectype-colorforth-bw ( ... rectype index state -- ... )
  drop \ we use index, not the surrounding Forth interpreter's state
  swap execute ;

: recognize-colorforth-bw ( c-addr u -- )
  dup 0= if 2drop ['] notfound exit then
  over c@ prefix>index dup 0 > if 2drop drop ['] notfound exit then
  >r 1 /string recognizer1 r> ['] rectype-colorforth-bw ;

' recognize-colorforth-bw set-forth-recognize

This has only 20 lines (vs. 44), and it uses all the recognizers originally present in forth-recognizer (name, integers (including doubles), FP, etc.). This demonstrates the superior expressive power of the rectypes from [160] over the translators from [r1081].

BTW, I find the presence of both forth-recognize and forth-recognizer confusing, and would prefer to define forth-recognize as deferred word. If you have to have getters and setters, call the getter get-forth-recognize.

In this approach, why do you need to write «[postpone _foo» instead of «]foo» ?

Nobody is suggesting that. But you need to perform xt-post in order to implement ]foo. In your implementation, you do it by reimplementing xt-post for the two recognizers you implement internally to recognize-colorforh-bw. If you would use a detour through postpone instead, you would use the xt-post invoked in that way. And in my implementation above, xt-post is invoked directly.

ruv [r1405] 2024-12-08 22:59:20

@AntonErtl writes:

Ruv's colorforth-bw implementation demonstrates the shortcomings of the present proposal, because it does not use recognizers nor translators at all for implementing recognize-colorforth-bw; instead, it reimplements everything that the name recognizer and the number recognizer already do internally,

It's wrong. Have a look at L18-L19:

  \ Reuse a recognizer for numbers
  ['] recognize-number-n-prefixed apply-recognizer-cf dup 0= if exit then

It uses the recognizer for numbers. And it uses find-name instead of the recognizer for names (Forth words) just because it's simpler in this case. It does not reuse token translators.

And it only implements dealing with names and single-cell numbers.

Because your original example implemented only that. And I just rewrote your original example.

Finally, the implementation is so long (44 lines without putting it into forth-recognize) that you have not shown it inline, but posted a link to github.

Why count 10 lines of comments at the beginning of the file? Without comments, 31 lines, the same as in your example (lexical size is greater due to nt vs xt, and improvements in the behavior).

By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):

[...]

This has only 20 lines (vs. 44), and it uses all the recognizers originally present in forth-recognizer (name, integers (including doubles), FP, etc.). This demonstrates the superior expressive power of the rectypes from [160] over the translators from [r1081].

(I corrected the [r1081] link in the citation above)

This comparison is incorrect. Below is an implementation against the latest API version (except compile-postpone-qtoken that is a variation of discussed postpone-qtoken, which should be either present or implementable in any variant of API):

: cf-prefix>tt? ( c -- tt true | c false )
  case
    '[' of ['] execute-interpreting endof
    '_' of ['] execute-compiling endof
    ']' of ['] compile-postpone-qtoken endof
    0 exit
  endcase true
;

defer recognize-default  perceptor is recognize-default

: recognize-colorforth-bw ( sd.lexeme -- qt|0 )
  dup 0= if nip exit then
  over c@ cf-prefix>tt? 0= if drop 2drop 0 exit then
  >r 1 /string recognize-default dup if r> exit then rdrop
;

16 lines.

Can be tested in Gforth too:

gforth index.fth example/recognize-colorforth-bw.fth

:noname cf( _1. _drop _s" foo" ) ; execute s" foo" compare 0=  .s \ prints "1 -1"

AntonErtl [r1407] 2024-12-09 07:25:04

The latest proposal is [r1081] and it does not contain execute-interpreting, execute-compiling, compile-postpone-qtoken, or perceptor. And that's what we were tasked with discussing and giving feedback on. And that's what I did.

ruv [r1408] 2024-12-09 09:38:01

The latest proposal is [r1081] and it does not contain execute-interpreting, execute-compiling, compile-postpone-qtoken, or perceptor. And that's what we were tasked with discussing and giving feedback on. And that's what I did.

I see, thank you. Actually, [r1081] is outdated, a new version will be prepared soon and then it should be discussed (was noted in the recognizer chat). Nevertheless, my example implementation for recognize-colorforth-bw above is compatible with [r1081] with the following exceptions: it relies on 0 instead of NOTFOUND (you should note how it makes things simpler), and it uses the method compile-postpone-qtoken that appends the compilation semantics of a qualified token to the current definition (this method is missing in [r1081]). The word perceptor is simply a better name than forth-recognizer in [r1081] (I just posted in ForthHub/fep-recognizer a rationale from the chat).

The words execute-interpreting and execute-compiling are general words that are needed anyway to perform interpretation or compilation semantics regardless the initial STATE, they are implemented in the standard Forth as:

: compilation ( comp: true ; S: -- true ; | comp: false ; S: -- false ; )  state @ 0<> ;
: enter-compilation ( comp: false -- true ; S: -- ; | comp: true  ; S: -- ; )  ] ;
: leave-compilation ( comp: true -- false ; S: -- ; | comp: false ; S: -- ; )  postpone [ ;
: execute-interpreting ( i*x xt -- j*x )
  compilation 0= if execute exit then
  leave-compilation execute enter-compilation
;
: execute-compiling ( i*x xt -- j*x )
  compilation if execute exit then
  enter-compilation execute leave-compilation
;

ruv [r1409] 2024-12-09 10:00:24

@AntonErtl writes:

If you eliminate the state-dependence of translators, then text interpreters that use more than just the xt-int action (e.g., the one for colorforh-bw, see below) can be written without having to deal with state.

Token translators cannot be written without having to deal with state (possibly indirectly), by the term definition. A token translator shall perform different actions depending on the state, and it does not matter how the state is passed to the translator: though the data stack, through a separate stack intended for this purpose, or though an internal variable. The state does not matter in only one case: if the translator shall perform the same action regardless of the state.

Moreover, if you pass a parameter that encodes compilation state or interpretation state not through STATE, you have to make STATE to be in sync with this parameter to guarantee that STATE-dependent words are translated correctly.

BerndPaysanNew Version: minimalistic core API for recognizers [r1412] 2024-12-15 22:00:00

Show differences

Minimalistic Recognizer API

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version
2020-09-08 taking ruv's approach and vocabulary at translators
2020-09-08 replace the remaining rectypes with translators
2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
2022-09-08 adjust reference implementation to results of last bikeshedding discussion
2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
2022-09-10 More complete reference implementation
2022-09-10 Add use of extended words in reference implementation
2022-09-10 Typo fixed
2022-09-12 Fix for search order reference implementation
2022-09-15 Revert to Trute's table approach to call specific modes deliberately
2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
2023-09-13 Make clear that TRANSLATE: is the only way to define a standard-conforming translator.
2023-09-15 Add list of example recognizers and their names.
2024-12-15 Take comments after freezing the proposal into account

Problem

The Forth compiler can be extended easily. The Forth interpreter however has a fixed set of capabilities as outlined in section 3.4 of the standard text: Words from the dictionary and some number formats.

It's not possible to use the Forth text interpreter in an application or system extension context. Most interpreters in existing systems use a number of hooks to extent the interpreter. That makes it possible to use a loadable library to implement new data types to be handled like the built-in ones. An example are the floating point numbers. They have their own parsing and data handling words including a stack of their own.

Furthermore applications need to use system provided and system specific words or have to re-invent the wheel to get numbers with a sign or hex numbers with the $ prefix. The building blocks (FIND, COMPILE,, >NUMBER etc) are available but there is a gap between them and what the Forth interpreter already does.

The Forth interpreter is stateful, but the API should avoid the problems of the STATE variable. In particular, an implementation without STATE should be possible, and there is only one place where the stateful dispatch is necessary.

Solution

The monolithic design of the Forth interpreter is factored into three major blocks:

The interpreter. It extracts sub-strings (lexemes) from SOURCE, hands them over to the data parsing and processes the results.
The actual data parsing. It analyses lexemes whether they match the criteria for a certain token type. These words, called recognizers, can be grouped to achieve an order of invocation.
The result of the recognizer, a translator and associated data, is handed over to the interpreter.

There is no strict 1:1 relation between a recognizer and the returned translator. A translator for e.g. single cell numbers can be used by different recognizers, a recognizer can return different translators (e.g. single and double cell numbers).

Whenever the Forth text interpreter is mentioned, the standard words EVALUATE (CORE), ' (tick, CORE), INCLUDE-FILE (FILE), INCLUDED (FILE), LOAD (BLOCK) and THRU (BLOCK) are expected to act likewise. This proposal is not about to change these words, but to provide the tools to do so. As long as the standard feature set is used, a complete replacement with recognizers is possible.

Important changes to the Matthias Trute proposal:

Make the translators executable to dispatch according to the state (interpreting, compiling, postponing) themselves
Use dedicated invocation methods to call a translator for a particular state
Make the recognizer sequence executable with the same effect as a recognizer
Make sure the API is not mandating any particular implementation

The core principle is that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like

: recognize-xt ( addr u -- translator-stub | 0 )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] noop
  THEN ;

then you should factor the part starting with STATE @ out and return it as translator:

: translate-xt ( xt flag -- )
  0< state @ and  IF  compile,  ELSE  execute  THEN ;
: recognize-xt ( addr u -- ... translator | 0 )
  here place  here find dup IF  [']  translate-xt  THEN ;

In a second step, you need to remove the STATE @ entirely and use TRANSLATE:. If you don't know what to do on postpone in this stage, use -48 throw, otherwise define a postpone action:

:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF  compile,  ELSE  execute  THEN ;
:noname ( xt flag -- ) 0< IF  postpone literal postpone compile,  ELSE  compile,  THEN ;
translate: translate-xt

Typical use

The standard interpreter loop should look like this:

: interpret ( i*x -- j*x )
  BEGIN  parse-name dup  WHILE  forth-recognize ?found execute  REPEAT
  2drop ;

with the usual additions to check e.g. for empty stacks and such.

Operating a recognizer in a particular state, e.g. to postpone a single word, do

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

to optain an xt for a name, use something like that:

: ' ( "name" -- xt )
  parse-name forth-recognize ?found
  ['] translate-nt <> #-32 and throw
  name>interpret ;

Proposal:

XY. The optional Recognizer Wordset

XY.1 Introduction

Recognizers have the form

REC-SOMETYPE ( addr len -- i*x j*r translate-xt | 0/NOTFOUND )

A recognizer takes the string addr len of a lexeme and on success returns a translator translate-xt and additional data on the data and floating point stack.

[IF] NOTFOUND=0

If it fails, it returns 0.

[ELSE] NOTFOUND=xt

If it fails, it returns the xt of NOTFOUND.

For clarity, unless this issue is decided, the non-success return value of a recognizer is notated as 0/NOTFOUND. The reference implementation uses the option 0.

[THEN] notfound

[IF] side-effect

A recognizer shall not have a side effect.

Rationale: Side effects are supposed to all happen inside the translators. This promise allows to try recognize something and fail if the result is not desired without having to roll back unkown changes. Examples: The tick and to recognizer pass a substring of the to be translated string to FORTH-RECOGNIZE, and fail if the result is not a name type.

[THEN] side-effect

XY.3 Additional usage requirements

XY.3.1 Translator

translator: named subtype of xt, and executes with the following stack effect:

name ( j*x i*x -- k*x )

A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.

i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the recognized lexeme.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZE ( addr len -- i*x translator-xt | 0/NOTFOUND ) RECOGNIZER

Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or 0/NOTFOUND if not.

[IF] defer

FORTH-RECOGNIZE is a deferred word. Changing the system recognizer can be done with IS FORTH-RECOGNIZE, obtaining the system recognizer with ACTION-OF FORTH-RECOGNIZE.

Rationale: use existing API to change it; most simple system have this available, and advanced systems have capabilities to work around limitations.

[ELSE] setter and getter

SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT

Assign the recognizer xt to FORTH-RECOGNIZE.

FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT

Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.

Rationale: not sufficiently advanced systems can work around the limitations of IS and ACTION-OF better with this API.

[THEN]

TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER

Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.

"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current state.

Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. You can not simply set STATE, use EXECUTE and afterwards restore STATE to perform interpretation or compilation semantics, because words can change STATE, so you need the words INTERPRETING and COMPILING defined below. This problem does not apply to POSTPONING, so systems that only want to implement direct access to POSTPONE mode can get away without TRANSLATE:.

[IF] NOTFOUND=0

?FOUND ( translator-xt -- translator-xt | 0 -- never ) RECOGNIZER

Check if the recognizer was successful, and if not, perform a -13 THROW or display an appropriate error message if the exception wordset is not present.

[THEN] NOTFOUND=0

XY.6.2 Recognizer Extension Words

[IF] NOTFOUND=0

?NOTFOUND ( translator-xt -- translator-xt | 0 -- addr u notfound-xt )

Check if the recognizer was successful. If not, replace the 0 result with the addr u of the last scanned lexeme, and put the xt of the NOTFOUND translator on top of the stack.

NOTFOUND ( -- never ) RECOGNIZER

Translator for unsuccessful recognizers: perform a -13 THROW.

[THEN] NOTFOUND=0

POSTPONE ( "<spaces>lexeme" -- ) RECOGNIZER

Compilation: recognize lexeme. On success, perform the postpone action of the returned translator, otherwise -13 THROW or display the appropriate error message if the exception wordset is not present.

RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT

Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn on stack and proceeding towards xt1 until successful.

SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT

Set the recognizer sequence of xt-seq to xt1 .. xtn.

GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT

Obtain the recognizer sequence from xt-seq as xt1 .. xtn n.

TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT

Translates a name token:

Interpretation: perform the interpretation semantics of the word

Compilation: perform the compilation semantics of the word

Postpone: append the compilation semantics above to the current definition

REC-NT ( addr u -- nt translate-nt | 0/NOTFOUND ) RECOGNIZER EXT

Search the dictionary for the string addr u. If successful, return the nt and the xt of TRANSLATE-NT. If the search fails, return 0/NOTFOUND.

TRANSLATE-NUM ( x -- x | ) RECOGNIZER EXT

Translates a number:

Interpretation: keep the number on the stack

Compilation: Append the run-time defined in LITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

TRANSLATE-DNUM ( x1 x2 -- x1 x2 | ) RECOGNIZER EXT

Translates a double number:

Interpretation: keep the numbers on the stack

Compilation: Append the run-time defined in 2LITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

REC-NUM ( addr u -- x translate-num | xd translate-dnum | 0/NOTFOUND ) RECOGNIZER EXT

Convert addr u to a number x and the xt of TRANSLATE-NUM as specified in 3.4.1.3 or a double number xd and the xt of TRANSLATE-DNUM as specified in 8.3.1 if the double number wordset is available. If the conversion fails, return 0/NOTFOUND.

TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT

Translates a floating point number:

Interpretation: Keep r on the stack

Compilation: Append the run-time defined in FLITERAL to the current definition

Postpone: Append the compilation semantics above to the current definition

REC-FLOAT ( addr u -- r translate-float | 0/NOTFOUND ) RECOGNIZER EXT

Convdert addr u to a number r specified in 12.3.7 if the float wordset is availabe; if the conversion fails, return 0/NOTFOUND.

SCAN-TRANSLATE-STRING ( addr1 u1 string-rest<"> -- addr2 u2 | ) RECOGNIZER EXT

Complete parsing a string: addr1 u1 consists of the starting quote and additional characters up to the first space in the string. addr2 u2 consists of the entire string without the starting quote up to (but not including) the final quote, and translated the escape sequences according to the rules of S\\". >IN is modified appropriately, and points just after the final quote. If there's no final quote in the current line, REFILL can be used to read in more lines, adding corresponding newlines into the string. The final quote can be inside addr1 u1, setting >IN backwards in that case.

Translate the string:

Interpretation: keep the string on the stack

Compilation: Append the run-time defined in SLITERAL to the current definition

Postpone: Append the compilation semantics stated above to the current definition

** TRANSLATE-STRING** ( addr1 u1 -- addr1 u1 | ) RECOGNIZER EXT

Translate the string:

Interpretation: keep the string on the stack

Compilation: Append the run-time defined in SLITERAL to the current definition

Postpone: Append the compilation semantics stated above to the current definition

?SCAN-STRING ( addr1 u1 scan-translate-string string-rest<"> -- addr2 u2 translate-string | ... translator -- ... translator ) RECOGNIZER

If the recognized token is an incompleted string, complete the scanning as defined for SCAN-TRANSLATE-STRING and replace the translator with the xt of TRANSLATE-STRING.

REC-STRING ( addr u -- addr u translate-string | 0/NOTFOUND ) RECOGNIZER EXT

Check if addr u starts with a quote, and return that string and the xt of SCAN-TRANSLATE-STRING if it does, 0/NOTFOUND otherwise.

[IF] Optional API for direct access of translator states

INTERPRETING ( j*x xt -- k*x ) RECOGNIZER EXT

Execute xt-int of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in interpreting state.

COMPILING ( j*x xt -- ) RECOGNIZER EXT

Execute xt-comp of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in compiling state.

POSTPONING ( j*x xt -- ) RECOGNIZER EXT

Execute xt-post of the translator xt. If xt is not a translator, do -21 THROW, or a best-effort attempt to execute xt in postponing state.

GET-STATE ( -- xt ) RECOGNIZER EXT

Obtain the operation xt performed when translating.

SET-STATE ( xt -- ) RECOGNIZER EXT

Makes xt the operation performed when translating. If xt is not related to ' INTERPRETING, ' COMPILING, or ' POSTPONING, do -12 THROW.

[THEN] optional API for direct access of translator states

]] ( -- ) RECOGNIZER EXT

Interpretation semantics: undefined

Compilation semantics: Set the system into postpone state. The interpreter will then perform post-xt of all translators found. Compilation state resumes when [[ is recognized. This word may change STATE and the recognizer sequence to reflect the change of this state.

[[ ( -- ) RECOGNIZER EXT

Interpretation semantics: undefined

Compilaton semantics: undefined

Postpone semantics: enter compilation state, see ]; all changes to STATE and recognizer sequence done by ]] are reverted.

Note: [[ needs special treatment in postpone mode, so it might also use a non-standard translator and be not a word at all.

STATE ( -- addr ) RECOGNIZER

If ]] uses STATE to store postpone state, extends the semantics of 6.1.2250 by adding a second non-zero value. ]] enters this state, and [[ leaves it. Only translators and the code responsible for displaying the prompt can see this third state, as all other words are postponed in this state.

Reference implementation:

Defer forth-recognize ( addr u -- i*x translator-xt / 0 )
: ?found ( translator -- translator  |  0 -- never )
  dup 0= IF  -13 throw  THEN ;
: interpret ( i*x -- j*x )
  BEGIN
      parse-name dup  WHILE
      forth-recognize ?found execute
  REPEAT ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , ,
  does> state @ 2 + cells + @ execute ;

An alternative implementation for TRANSLATE: can use a deferred word:

Defer do-translate
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
  create , , , does> do-translate ;
: set-state ( xt -- ) dup is do-translate  >body @ 2 - state ! ;
: get-state ( -- xt ) action-of do-translate ;

Extensions reference implementation:

: ]] -2 state ! ; immediate
: [[ -1 state ! ; immediate
:noname name>interpret execute ;
:noname name>compile execute ;
:noname dup name>interpret ['] [[ =
  IF    name>interpret execute \ special case
  ELSE  name>compile swap lit, compile,  THEN ;
translate: translate-nt ( nt -- )
: lit,  ( n -- )  postpone literal ;
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )

: rec-nt ( addr u -- nt nt-translator | 0 )
  forth-wordlist find-name-in dup IF  ['] translate-nt  THEN ;
: rec-num ( addr u -- n num-translator | 0 )
  0. 2swap >number 0= IF  2drop ['] translate-num  ELSE  2drop drop 0  THEN ;

: minimal-recognize ( addr u -- nt nt-translator | n num-translator | 0 )
  2>r 2r@ rec-nt dup ['] notfound = IF  drop 2r@ rec-num  THEN  2rdrop ;

' minimal-recognizer is forth-recognize

: translate-method: ( n -- )
  Create , DOES> @ cells + >body @ execute ;
0 translate-method: postponing
1 translate-method: compiling
2 translate-method: interpreting

: set-state ( xt -- )
  >body @ 2 - state ! ;
: get-state ( -- xt )
  case state @
      0  of ['] interpreting  endof
      -1 of ['] compiling     endof
      -2 of ['] postponing    endof
  -11 throw
  endcase ;

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

This reference implementation uses a table dispatch only. Note that this can give surprising results when you directly apply a particular state, and one of the words executed (translator or nt/xt found) is a state-smart word. If you want to use combined translators, like

: translate-dnum ( d -- ) >r translate-num r> translate-num ;

you can't do it like this. Neither does this work if you execute state-smart words, as they expect STATE to be set accordingly. Instead, you'll use something like

: translate-method: ( n -- ) Create , DOES> @ dup state @ = IF drop execute EXIT THEN state @ >r state ! execute r> state ! ;

This will definitely work for combined literal translators, because those don't change state anyways.

This will also work for POSTPONE, because apart from the tranlator, no word is actually executed in one-shot POSTPONE, and therefore, no state change is possible.

This will also work for [ and ] (and words using them) while interpreting and compiling, because if you are already in the state from which the state is changed away, you will not restore the state. If you are in the state this will change to, this will work, too, because the state is restored after EXECUTE. This will not work if you are interpreting, and you do a s" ]]" forth-recognize ?found compiling, because that transitions to postponing, and then is reverted to interpreting.

[IF] setter and getter

: set-forth-recognize ( xt -- )
  is forth-recognize ;
: forth-recognizer ( -- xt )
  action-of forth-recognize ;

[THEN] setter and getter

Stack library

: STACK: ( size "name" -- )
  CREATE 0 , CELLS ALLOT ;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
  DUP @ >R R@ CELLS + R@ BEGIN
    ?DUP
  WHILE
    1- OVER @ ROT CELL - ROT
  REPEAT
  DROP R> ;

Recognizer sequences

: recognize ( addr len rec-seq-id -- i*x translator-xt | 0 )
  DUP >R @
  BEGIN
    DUP
  WHILE
    DUP CELLS R@ + @
    2OVER 2>R SWAP 1- >R
    EXECUTE DUP IF
      2R> 2DROP 2R> 2DROP EXIT
    THEN
    DROP R> 2R> ROT
  REPEAT
  DROP 2DROP R> DROP 0
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
  min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
  DOES>  recognize ;
: ?defer@ ( xt1 -- xt2 )
  BEGIN dup is-defer? WHILE  defer@  REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- )
  ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n )
  ?defer@ >body get-stack ;

Once you have recognizer sequences, define

' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize

: find-name-in ( addr u wid -- nt / 0 )
  execute dup IF  drop  THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
  ['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
  ['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
  ['] search-order set-recognizer-sequence ;

Recognizer examples

Apart from the standardized recognizers above, here are some more examples of recognizers:

REC-TICK ( addr u -- xt translate-num | 0/NOTFOUND ) If addr u starts with a ``` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.

REC-SCOPE ( addr u -- nt translate-nt | 0/NOTFOUND ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE is identical in effect to REC-NT.

REC-TO ( addr u -- xt n translate-to | 0/NOTFOUND ) Handle the following syntax of TO-like operations of value-like words:

->name as TO name
=>name as IS name
+>name as +TO name
'>name as ADDR name
@>name as ACTION-OF name

xt is the execution token of the value found, n indexes which variant of a TO-like operation is meant, and translate-to is the corresponding translator.

REC-ENV ( addr u -- addr1 u1 translate-env | 0/NOTFOUND ) Takes a pattern in the form of ${name} and provides the name as addr1 u1 on the stack. The corresponding translator TRANSLATE-ENV is responsible for looking up that name in the operating system's environment variable array, or compiling appropriate code to do so.

REC-COMPLEX ( addr u -- rr ri translate-complex | 0/NOTFOUND ) Converts a pair of floating point numbers in the form of float1+float2i into a complex number on the stack, and returns the xt of TRANSLATE-COMPLEX on success.

Testing

T{ 0 recognizer-sequence: RS -> }T

T{ :noname 1 ;  :noname 2 ;  :noname 3  ; translate: translate-1 -> }T
T{ :noname 10 ; :noname 20 ; :noname 30 ; translate: translate-2 -> }T

\ really stupid: 1 character length or 2 characters
T{ : rec-1 NIP 1 = IF ['] translate-1 ELSE 0 THEN ; -> }T
T{ : rec-2 NIP 2 = IF ['] translate-2 ELSE 0 THEN ; -> }T

T{ ' translate-1 interpreting  -> 1 }T
T{ ' translate-1 compiling     -> 2 }T
T{ ' translate-1 postponing    -> 3 }T

\ set and get methods
T{ 0 ' RS set-recognizer-sequence -> }T
T{ ' RS get-recognizer-sequence -> 0 }T

T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T
T{ ' RS get-recognizer-sequence -> ' rec-1 1 }T

T{ ' rec-1 ' rec-2 2 ' RS set-recognizer-sequence -> }T
T{ ' RS get-recognizer-sequence -> ' rec-1 ' rec-2 2 }T

\ testing RECOGNIZE
T{         0 ' RS set-recognizer-sequence -> }T
T{ S" 1"     RS   -> 0 }T
T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T
T{ S" 1"     RS   -> ' translate-1 }T
T{ S" 10"    RS   -> 0 }T
T{ ' rec-2 ' rec-1 2 ' RS set-recognizer-sequence -> }T
T{ S" 10"    RS   -> ' translate-2 }T

ruv [r1422] 2025-02-13 13:14:50

Re `interpreting`

(similar arguments apply to the word compiling too)

From the proposal's "Problem" section:

The Forth interpreter is stateful, but the API should avoid the problems of the STATE variable. In particular, an implementation without STATE should be possible, and there is only one place where the stateful dispatch is necessary.

We should consider that Forth words may do stateful dispatch by themselves and they may rely on the value of STATE. Usually, the Forth system itself cannot determine whether a user-defined word perform stateful dispatch. Therefore, it is essential for the Forth system to ensure the STATE variable is correctly set to reflect the formal state of the Forth text interpreter when executing a user-defined word.

The assumption that the value of STATE is irrelevant when xt-int is executed—because xt-int does not perform stateful dispatch itself—is flawed. This is because, when xt-int is executed, it may invoke a user-defined word that performs stateful dispatch.

The suggested word INTERPRETING ( j*x xt -- k*x ) is confusing and useless, because it just executes xt-int (obtained from xt) and does not ensure that the value of STATE is 0 before a user-defined word is invoked by xt-int.

I suggested the word execute-interpreting that applies to any xt. When it applies to a token translator, the corresponding interpretation semantics are performed. And this correctly works even when a user-defined word that performs stateful dispatch is invoked by the token translator.

From the rationale to TRANSLATE::

The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. You can not simply set STATE, use EXECUTE and afterwards restore STATE to perform interpretation or compilation semantics, because words can change STATE, so you need the words INTERPRETING and COMPILING defined below.

The provided specification does not guarantee that interpreting and compiling solve this problem, and in the reference implementation they do not solve the problem.

The words execute-interpreting and execute-compiling solve the problem, and they do not need to know xt-int or xt-comp from a token translator.

Re "postponing" state

What is a rationale to formally introduce "postponing" state? If you need it only for ]] ... [[, then it's better to extract them into a separate proposal, and also provide a ground why this approach is better than implement ]] as a parsing word.
Why do you need to specify that ]] changes STATE to a third value if user-defined words do not see this value and cannot analyze this value?

Re `set-state` and `get-state`

It's unclear how these words can be used.

It seems that the word SET-STATE is underspecified. Also, it's name is confusing, because it formally is not allowed to change STATE (at the moment).

Re translators and `translate:`

translator: named subtype of xt, and executes with the following stack effect: name ( jx ix – k*x )

Why do you require a translator to be named? I use anonymous token translators (defined as quotations) and find them very useful.

From the spec to TRANSLATE::

Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.

It's necessary to define what a "general purpose translator" is, and it should be clear how a translator that is not a general purpose translator can be defined in a standard way.

Minimize core

To reduce the scope of discussion, we should minimize the core API.

So, it is better to put the words ]], [[, RECOGNIZER-SEQUENCE, etc, to separate proposals.

But a recognizer for local variables should be added, because they are already standardized and a Forth system that supports local variables must recognize them.

GeraldWodni [r1423] 2025-02-13 14:17:51

@ruv: I agree, that it would be nicer to have sequences etc. not in the proposal. However: They are a good example of how recognizers can be used in practice, which makes me think they should stay inside to give some guidance to rec:newbies. We can still ask Bernd for further modifications, but I think we should do so grouped together after the meeting, to avoid unnecessary edits.

BerndPaysan [r1424] 2025-02-14 11:21:35

I don't see a problem to separate the proposal into several smaller ones, especially taking optional parts out that belong together.

The postpone mode can indeed either implemented loop-style (i.e. like PolyForth's ]), or with a state; it shouldn't be necessary to specify the details.

If you have STATE-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves having STATE as expected, you can't just call INTERPRETING or COMPILING on a translator or use some table index mechanism as in the Trute proposal to call the right slot.

If you don't have such things and have a Forth system where STATE-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE-smart words yourself, you can actually use that API. That's why I think such an API can be actually standardized before we make STATE obsolescent and have standardized replacements available.

BerndPaysan [r1425] 2025-02-14 15:18:17

can actually be standardized

I mean can't. We need to phase out STATE and define possible replacements before we can have a STATEless API.

AntonErtl [r1426] 2025-02-15 09:57:48

At the online meeting on 2025-02-13 I was asked to present a subproposal for factoring the state-dependent component out of TRANSLATE:.

There are many possible ways to skin this cat, e.g., the one in Matthias Trute's proposal, or the way that present proposal used up to v4 and earlier. Here I present a way that requires relatively few changes to the current version of this proposal.

XY.3.1 Definition of terms

Replace the definition of translator with:

translator: a cell-sized opaque token that represents how a recognized lexeme can be interpreted, compiled, or postponed. A translator usually needs additional data about the recognized lexeme that is deeper in the stacks.

Replace uses of translator-xt in ?NOTFOUND with translator, and likewise for other words that, in [r1412], consume or push the xt of a translator.

XY.6 Glossary

TRANSLATOR:

Replace the definition of TRANSLATE: with

TRANSLATOR: ( xt-int xt-comp xt-post "<spaces>name" -- )

Skip leading space delimiters. Parse name delimited by a space. Create a definition for name with the execution semantics defined below.

name is referred to as translator.

name Execution: ( -- translator )

translator represents a translator with interpretation action xt-int, compilation action xt-comp, and postpone action xt-post..

Modified words:

INTERPRETING ( i*x translator -- k*x )

Execute xt-int of translator.

COMPILING ( j*x translator -- l*x )

Execute xt-comp of translator.

POSTPONING ( j*x translator -- )

Execute xt-post of translator.

STATE-TRANSLATING

Add:

STATE-TRANSLATING ( i*x translator -- j*x )

Remove translator from the stack.

If the system has a postpone state, and is currently is in postpone state, execute xt-post of translator.

Otherwise, if the system is in interpretation state, execute xt-int of translator.

Otherwise, execute xt-comp of translator.

Discussion

The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside (compared to [r1412]).

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

Typical use

The standard interpreter loop:

: interpret ( i\*x -- j\*x )
  BEGIN  parse-name dup  WHILE  forth-recognize ?found state-translating  REPEAT
  2drop ;

Implementation of POSTPONE is the same as in the existing proposal:

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

The implementation of ' becomes slightly shorter (no need to tick translate-nt:

: ' ( "name" -- xt )
  parse-name forth-recognize ?found
  translate-nt <> #-32 and throw
  name>interpret ;

Now for interpreter loops that do not use STATE.

First, the polyForth division of interpreter and compiler:

: parse-name-refill ( -- c-addr u )
  begin
    parse-name dup 0= while
      2drop refill 0= if
        0 0 exit then
  repeat ;

: ] ( i\*x -- j\*x )
  BEGIN
    parse-name-refill dup while
      2dup "[" str= 0= while
        forth-recognize ?found compiling
  REPEAT
  2drop ;

: pf-interpret ( i\*x -- j\*x )
  BEGIN  parse-name-refill dup  WHILE  forth-recognize ?found interpreting  REPEAT
  2drop ;

And here's one for colorforth-bw:

: cfbw-interpret ( i\*x -- j\*x )
  begin
    parse-name dup  while
      over c@ >r 1 /string forth-recognize ?found r> case
        '[' of interpreting endof
        '_' of compiling endof
        ']' of postponing endof
        -13 throw
      endcase
  repreat ;

The problem with these interpreters is that there is no standardized or proposed way to plug this interpret into the existing infrastructure (e.g., included), so the benefit of being able to write this is limited to one line (in case of colorforth-bw) or the rest of the file in case of the polyForth-style interpreter.

But the recognizer proposal allows to replace forth-recognizer, and this allows us to plug in colorforth-bw into the text interpreter until further notice. I presented a way to do it with an earlier version of this proposal in [r1397], here's a way for doing it with [r1412] modified by this sub-proposal:

defer recognizer1 action-of forth-recognize is recognizer1

: translator-bw1 ( i\*x translator c -- j\*x )
  case
    '[' of interpreting endof
    '_' of compiling endof
    ']' of postponing endof
    -13 throw
  endcase ;

' translator-bw1 dup dup translator: translator-bw

: recognize-colorforth-bw ( c-addr u -- translator )
  dup 0= if 2drop 0 exit then
  over c@ >r 1 /string recognizer1
  r> over if translator-bw else drop then ;

' recognize-colorforth-bw is forth-recognize

Reference implementation:

A straightforward implementation is:

: translator: ( xt-int xt-comp xt-post "\<spaces\>name" -- )
  create , , , ;

: state-translating ( i\*x translator -- j\*x )
  state @ if compiling else interpreting then ;

This does not cover a potential postpone state; if a system has a postpone state and can enter the standard text interpreter in this state, then the implementation of state-translating should be extended accordingly.

Of course, this implementation of state-translating is far too inefficient for some tastes, so here's a more clever one:

: state-translating ( i\*x translator -- j\*x )
  2 state @ 0<> + cells + @ execute ;

For even more efficiency we can redefine `]' and '[':

defer state-translating

: [ ( -- )
  [ ( old implementation ) ['] interrpreting is state-translating ; immediate

[ \ initialize state-translating

: ] ( -- )
  ] ( old implementation ) ['] compiling is state-translating ;

If there is a word that sets the postpone state, that word should also set state-translating accordingly.

There are also a changes involving words that push literal translator tokens. In [r1412] the translator word needs to be ticked, in this subproposal you do not do that. E.g., rec-nt now looks as follows:

: rec-nt ( addr u -- nt nt-translator | 0 )
  forth-wordlist find-name-in dup IF  translate-nt  THEN ;

AntonErtl [r1427] 2025-02-16 18:45:17

STATE-dependence

[r1412] still contains a defining word for state-dependent translators (and none for translators without this mistake), which are unacceptable to me. I have suggested an improvement in [r1426].

Dividing the proposal?

There have been some discussions about dividing the proposal. I don't think that that's a good idea for the discussion, but in usage I see the division into the following hierarchy of use cases, which require different words; the later use cases usually require also implementing the words for the earlier use cases:

Programs that use the default recognizers. For them we need to specify a standard recognizer sequence (including how to deal with locals): REC-NT REC-NUM REC-FLOAT (if present) corresponds to Forth-2012. I expect that systems that have REC-STRING and REC-TICK to put these into their recognizer sequence, too. How do we document in the program documentation which recognizers are needed? Probably we need to extend the program documentation requirements (until now the recognition of doubles, floats and locals has been coupled with documenting the double, float and local wordset, respectively, but for REC-STRING and REC-TICK that's probably not the way to go).

The new POSTPONE is also at that usage level.
Programs that change which of the existing recognizers are used and in what order. For them we need the names of the existing recognizers (not sure about the translators), FORTH-RECOGNIZE, SET-RECOGNIZER-SEQUENCE, GET-RECOGNIZER-SEQUENCE, .RECOGNIZERS (not yet proposed) and maybe RECOGNIZER-SEQUENCE:. If all the standardized recognizers are in FORTH-RECOGNIZE by default, there will probably not be much of this kind of usage, except maybe to put REC-FLOAT in front of REC-NUM (to recognize "1." as float; REC-FLOAT would have to be to defined in more detail for that to work).
Programs that define new recognizers that use existing translators. This usage needs the names of the translators.
Programs that define new translators. This usage needs TRANSLATE: (or TRANSLATOR:).
Programs that define text interpreters and programming tools that have to deal with recognizers (such as a recognizer-aware postpone). These programs need INTERPRETING, COMPILING, POSTPONING or STATE-TRANSLATING.

A system with recognizers is a program of all these types, so all these words will be present in every such system (with the exception of some recognizers and related translators), so there is little point in making most of these words optional (except rec-float, rec-string, rec-tick and translators used only by those recognizers). But it is still a good idea to present the words divided by these usages. We usually present words in alphabetical order in the document. Should we continue this tradition for these words? If so, the division of words above should probably be documented in the rationale.

For word counters

Given that usage 5 above is rare in user programs, word counters may prefer to replace the four words INTERPRETING, COMPILING, POSTPONING or STATE-TRANSLATING with one word

TRANSLATING ( ix translator n -- jx )

where

0 TRANSLATING is equivalent to INTERPRETING
-1 TRANSLATING is equivalent to COMPILING
-2 TRANSLATING is equivalent to POSTPONING
STATE @ 0<> TRANSLATING is equivalent to the reference implementation of STATE-TRANSLATING

A simple Forth system has only one use of POSTPONING (in POSTPONE) and one use of STATE-TRANSLATING (in INTERPRET), so defining 4 words for the purpose may seem excessive. And replacing them with TRANSLATING saves a tiny bit of source code and memory.

OTOH, there is no standard way to use TRANSLATING for STATE-TRANSLATING in the general case, where the system has a postpone state, because there is no standard way to determine postpone state. Moreover, the specification of TRANSLATING is not so nice (that's why I left it out in the above), and the code using it will be less readable.

Gerund

It's not clear to me why the gerund form is used (INTERPRETING etc.), although I kept with it for my suggestions (for consistency). I would use an imperative form; and because "interpret", "compile" and "postpone" are already taken, maybe something like TRANSLATOR>INTERPRET or somesuch, which would parallel NAME>INTERPRET. However, the latter pushes an xt, the former executes it, so either we let TRANSLATOR>INTERPRET also produce an xt, or use a slightly different naming scheme, such as TRANSLATOR*INTERPRET.

GET-STATE SET-STATE

It's unclear what get-state and set-state do, and their names suggest a stack effect ( -- f ) and ( f -- ).

The reference implementation does not make that any clearer; in particular, the reference implementation of set-state does not make any sense at all, and I would not know why anybody would want to use get-state.

[IF] parts

This makes the proposal hard to understand and discuss. Take a decision (possible after asking around, but I doubt that anyone but you and maybe ruv has a proper basis for an opinion), put it in the proposal, and give a rationale for the decision in a section Discussion.

Side effects

I do not see a good way to specify in the normative part of the document that a recognizer must not have a side effect. The proposal mentions "supposed to" and "promise". The normative part says what specific words do (or there is an ambiguous condition). It seems to me that the discussion about side effects should go into the non-normative rationale. It's clear enough what happens when somebody uses a word that invokes a recognizer, and that recognizer has a side effect; no need for an ambiguous condition.

NOTFOUND

I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?

`FORTH-RECOGNIZE`, deferred or getter and setter?

I see no benefits to having a getter and setter here. Deferred words are fine.

Presentation

The "Solution" chapter is not comprehensible except to those deep into the discussion: It is full of unexplained terms, such as "data parsing", "token type". And "translator" is not comprehensible to anybody who comes fresh to the proposal, and even to those who have seen some earlier recognizer proposals.

The second part of "Solution" should be a separate section "Transition for some implementors/users of Matthias Trute's proposal".

More `NOTFOUND` stuff

The proposal defines ?FOUND, ?NOTFOUND, and NOTFOUND only for NOTFOUND=0. This looks like a bug to me.

The stack effect of ?FOUND and other words: We do not have "never" in the standard. What's that supposed to mean?

XY.3.1 Translator

"named subtype"? What's that? The rest of the wording is woefully inadequate. A careful specification would reveal the complexity that you get with state-dependent translators.

?NOTFOUND

?NOTFOUND has a horrible stack effect. This word is not shown in any typical use examples? Is it needed? If it is needed, maybe the stack effects of the other words can be changed to make it unnecessary; although, admittedly, when I worked on combining recognizers, I did not find a solution with a nice stack flow (and I have tried). Hmm, maybe with a variant of case with a specialized variant of of?

POSTPONE

"if the exception wordset is not present". The exception wordset has been a required part of Forth200x for several years.

SET-RECOGNIZER-SEQUENCE

As specified, the sequence will always fit. Can the sequence fail to fit? If so, specify what happens.

REC-NUM

Should this be the all-singing, all-dancing variant (including doubles, number prefixes and '<char>')? Given existing practice and the legacy code base, yes. OTOH, with recognizers it seems a conceptually attractive option to have the rec-num be a decomposable sequence consisting of the various cases. But given nestable recognizer sequences, that's always an option for the future.

SCAN-TRANSLATE-STRING

This should follow C conventions for newlines like the rest of the string syntax, i.e., escape newlines with \. If other conventions are desired (e.g. what may or may not be JSON syntax), that would be for another recognizer and another translator.

The specification should be clear about what it does: "REFILL can be used to read in more lines" is neither here nor there.

TRANSLATE-STRING ?SCAN-STRING

What are these words good for? REC-STRING apparently does not need them.

[[

A word without interpretation nor compilation semantics?

Should we specify whether there is a postpone state, or alternatively that ]] has its own text interpreter loop? There are ways to distinguish these two kinds of implementation; does it matter? Maybe if you want to EVALUATE something in postpone state or somesuch.

]] and [[ should probably go into a separate proposal.

STATE

Changing the specification of state such that there is at least one non-zero value that does not mean "compilation state" is not an extension of the current specification of state, but a change. However, existing practice of systems which use -2 as postpone state suggest that this does not break existing code in practice. That's probably because so little existing code actually uses postpone state. With wider use of postpone state, some breakage may actually turn up.

The safe option would be to represent postpone state (if we have it at all) in a way other than through a value of STATE. E.g., have another variable POSTPONE-STATE: if it's false, then STATE determines the state; if it's true, the system is in postpone state.

In any case, if we put ]] in another proposal, that's where we should have this discussion.

BerndPaysan [r1428] 2025-02-16 23:03:46

Multiline strings

I don't think C is setting a good example. Nobody took C's syntax for proper multiline strings, not even C++. C is still an important legacy language, but COBOL also is in the top 20. You don't want to have multiline strings like COBOL.

C++11 got raw strings, and gcc supports them even in C. The syntax has a R"( as start, and a )" as end (with the option of adding more letters to disambiguate the string ending). Raw strings don't translate backslash+characters, which is often what you want, because the multiline string is actually some other programming language, and the editor is fine inserting all the characters you want there without escapes. Note that you need some way to disambiguate the string ending in a raw string, as you can't escape ".
Rust, Visual Basic (≥14), R, Ruby, and PHP strings are multiline by default (inserting newlines where the string has line breaks)
JavaScript (using template literals) and Go uses ` (backtick) for multiline (raw) strings
C# uses @" to start a multiline string
SQL uses ' (single quote) for multiline strings
Java 15 has text blocks (with """ as start and end)
Python use either """ or ''' or for multiline strings

Nobody makes proper multiline strings like C. Really nobody. Not even recent C compilers, they follow C++. I'm now at item 20 of Tiobe index, and most languages nowadays have multiline strings one way or the other. Getting Emacs to recognize multiline strings was easy: Just remove the \n from the end of string pattern. Emacs likes multiline strings. JSON-variants with multiline strings are likely from developers that use Ruby or PHP. You have to deal with this sort of stuff.

The most popular option seem to be multiline strings by default, when legacy (e.g. through a C-like syntax) isn't a problem. As we are adding a new syntax for string literals, we don't need to care about backwards compatibility. One popular feature is to remove blanks from auto-indented strings, as editors indent these strings. Strictly speaking, if we support non-raw multi-line strings, we could even parse C strings, if a \ as last character is defined as “don't add a newline here” (instead of “unfinished escape sequence”).

ruv [r1429] 2025-02-17 00:37:20

Bernd writes:

If you have STATE-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves having STATE as expected, you can't just call INTERPRETING or COMPILING on a translator or use some table index mechanism as in the Trute proposal to call the right slot.

Right.

If you don't have such things and have a Forth system where STATE-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE-smart words yourself, you can actually use that API.

In Forth, you almost always have such things, because you have EVALUATE and INCLUDE-FILE, which depend on STATE.

INCLUDE-FILE translates a file, EVALUATE translates a string. In practice, it's also necessary to translate a single lexeme, or even a single semantic token (like a number, xt, nt).

Bernd writes:

We need to phase out STATE and define possible replacements before we can have a STATEless API.

Recognizers already don't depend on STATE. Only some token translators depend on STATE. But we cannot avoid them in Forth system, and cannot eliminate STATE.

The existence of interpretation semantics and compilation semantics of Forth words is associated with two modes (states) of the Forth text interpreter: interpretation state and compilation state. The only way to essentially eliminate STATE is to eliminate one of these modes and the corresponding semantics. For example, one could remove interpretation state and interpretation semantics of words. This is possible, but the resulting language will not be backwards compatible with Standard Forth, since any parsing word must be an "immediate" word in this language.

For example, without interpretation state it's impossible to translate the following program:

: my'  ['] ' execute ;
my' my' constant mytick-xt

Changing the search order outside of definitions is also problematic:

also myvoc myword ( x ) previous  constant my-x

In this line, myword must be recognized in the modified search order. This is only possible in interpretation state, which means that the next lexeme is recognized only after the previous lexeme has been recognized and executed.

Factor is an example of a Forth-like language without interpretation sate. There, ordinary words are always "compiled" (added to AST), parsing words (and syntax words) are always immediately executed. See: Factor / Syntax / Parser algorithm.

ruv [r1430] 2025-02-17 01:31:51

Anton writes

Add: STATE-TRANSLATING ( ix translator -- jx )

Why is this better than making translator a subtype of xt, and using EXECUTE instead of STATE-TRANSLATING?

The benefits of making translator a subtype of xt:

no need for a separate word (for word counters);
a translator can be defined as a quotation or anonymous definitions (sometimes this is very convenient);
a new translator can be simply defied using other translators;
- an example for illustration the idea:
```
: translate-2lit ( 2*x -- 2*x | )
  >r translate-lit r> translate-lit
;
```
  in some my implementations example, postpone correctly applies to a lexeme that is recognized into a qualified semantic token with this translator.
the Forth text interpreter loop can be re-used for other purposes;
- just for illustration, reuse the Forth text interpreter to count lexemes in a string:
```
: count-lexemes ( sd.string -- u )
 0 rot rot  ['] example.evaluate [: 2drop 1+ ['] noop  ;] apply-perceptor
;
s" a b c d" count-lexemes . \ prints "4"
```
  See the apply-perceptor word definition in recognizer-api-ext.fth

ruv [r1431] 2025-02-17 06:33:43

Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating against `executefor the API users.

Please note that this is irrelevant to the question of whether translator is a subtype of xt or not.

For example, Translator is 1+

``` : count-lexemes ( sd.string -- u ) 0 rot rot ['] example.evaluate [: 2drop ['] 1+ ;] apply-perceptor ;

I provided above an example of a translator that is an xt, and

the only difference to the API users is whether state-translate or execute is used. And the latter allows provides more useful use cases.

ruv [r1432] 2025-02-17 06:45:34

The above message is a draft that was sent accidentally. A better edition is below -)

Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating against execute for the API users.

For example, in the case of execute a translator can be even as simple as 1+:

: count-lexemes ( sd.string -- u )
 0 rot rot  ['] evaluate [: 2drop ['] 1+ ;] apply-perceptor
;

This translator is not infected either by state or by a dummy triple ( xt-int xt-comp xt-post ) .

BerndPaysan [r1433] 2025-02-17 15:29:33

I use recognizers for non-Forth languages. These languages are usually state-free, i.e. they are interpret- or compile-only. Using a quotation for the translator is completely sufficient. E.g. the recognizer in net2o's chat message that matches URLs has

[: rework-% $, msg-url ;]

as translator. No need to define a triple-entry translator table. And the translators are indeed all that short, and there's no reusability (a token translates 1:1 to a command plus a way to add the corresponding data). This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it (I ended up with the generic name-translator, and just put the xt I wanted to execute on the stack underneath, so it worked in interpretation state, but was actually compiling message into a buffer). The text messages are parsed by standard EVALUATE, but a language-specific recognizer stack that has no single Forth recognizer in it.

Therefore I disagree with Anton that the current translator concept ties STATE to every translator: it's the other way round. It ties them only to full-blown Forth translators that work in a mixed interpreter/compiler language, where there is a state (and there, it is inevitable, and you can move that dispatch only around). You can define translators used by Forth with TRANSLATE:, but you can define translators used by other (single-state) languages just as ordinary xt, and with a single action for translation. There's no need for the table and dispatch if your language has no state at all, it's just EXECUTE of the one single action.

When you want to reuse slots of system translators in Gforth (e.g. for a Color-Forth clone), you can use action-of interpreting/compiling/postponing ( translator -- xt ) to access the subfields. That's, because all these accessing words are just identical to the defer field for value-style structures.

E.g. Anton's example could be

: cf-recognizer ( <[_]>addr u -- data translator | 0 )
  sp@ fp@ {: sp' fp' :} over c@ >r 1 /string recognizer1
  dup 0= IF   rdrop  EXIT  THEN
  case r> '[' of  action-of interpreting  endof
          '_' of  action-of compiling     endof
          ']' of  action-of postponing    endof
          fp' fp! sp' sp! 2drop 0 dup
  endcase ;

and that works (the vocabularies used by the colorForth core wouldn't have any STATE in it). That way, you don't need to write your own outer interpreting colorForth, the standard Forth interpreter does it.

My design assumption was that making all new data types (recognizer sequences, translators) subtypes of xt, and therefore executable, will pay off, and it did.

AntonErtl [r1434] 2025-02-19 21:50:22

Multiline strings

Checking on Python3, I see that it uses C's syntax for strings starting with ". In particular, if you just do a newline in the middle of a string without escaping the newline, you get an error:

>>> print("abc
  File "\<stdin\>", line 1
    print("abc
          ^
SyntaxError: unterminated string literal (detected at line 1)

An escaped newline is ignored, and you need to write \n to get an actual newline. I expect that it's the same for most other languages you mention, because they all use a different syntax for "proper multiline strings". I have no problem with an additional recognizer for "proper multiline strings" with a distinguishable syntax (such as """); I can even live with rec-string doing the additional syntax, but I think that there might be others who will disparage it as a WIBNI or somesuch.

But I think that, for "-delimited strings, rec-string should either not do multi-line strings at all or do it the C/Python3/etc. way.

STATE-TRANSLATING

Why is this better than making translator a subtype of xt, and using EXECUTE instead of STATE-TRANSLATING?

It is better because it isolates the state-dependence in the word(s) calling state-translating rather than having it in the translator coming out of the recognizer and potentially being invoked through any execute, compile,, is or defer! in the system (with data-flow analysis necessary to reduce the number of potential invocations, and the result of that analysis probably still showing more occurences than what searching for state-translating would otherwise give us).

It's similar to the difference between arming a bomb at the factory, or arming it only just before dropping it (which may never happen).

Examples of translators not produced with `translate:`

The proposal states about translate:

This word is the only standard way to define a general purpose translator.

Any argument based on defining translators in other ways is therefore not in line with the proposal.

This applies to [r1432] as well as [r1433].

So the usages you show may work on some particular implementation, but may fail on a different implementation of the proposal.

And if you are willing to design an implementation for some convenient code of your interpretation-only recognizers, I am sure that your are able to design an implementation of recognizers with state-translating that's just as convenient.

ruv [r1435] 2025-02-20 16:40:03

Making translator a subtype of xt

Why is this better than making translator a subtype of xt

It is better because it isolates the state-dependence in the word(s) calling state-translating rather than having it in the translator coming out of the recognizer and potentially being invoked through any execute, compile,, is or defer! in the system (with data-flow analysis necessary to reduce the number of potential invocations, and the result of that analysis probably still showing more occurences than what searching for state-translating would otherwise give us).

1. It does not isolate the state-dependence in the word(s) calling state-translating — because interpreting and compiling will also exhibit state-dependent behavior on some arguments. At the same time, state-translating will exhibit state-independent behavior on some arguments.

2. If the user prefer the word state-translating because it allows him to find invocations of translators in his code, he can define this word as synonym state-translating execute and use in his code.

3. Given the choice between the ability to find invocations of translators and the set of benefits that an xt subtype provides, I would prefer the latter.

Any argument based on defining translators in other ways is therefore not in line with the proposal.

Yes, but they are aimed at changing the proposal ))

And if you are willing to design an implementation for some convenient code of your interpretation-only recognizers, I am sure that your are able to design an implementation of recognizers with state-translating that's just as convenient.

Bernd wrote: "This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it".

Side effects

Anton wrote:

It seems to me that the discussion about side effects should go into the non-normative rationale.

Agreed. A standard word cannot have an unspecified side effect that can be detected by a standard program. Therefore, it's sufficient to specify the allowed effects for standard recognizers and for the perceptor in the standard Forth system.

NOTFOUND

Anton wrote:

I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?

In Matthias Trute's proposal I don't see any arguments why NOTFOUND is better than zero.

See my arguments why zero is better in the section "Special data object on failure considered harmful" of my comment [r1351] 2024-10-08.

FORTH-RECOGNIZE, deferred or getter and setter?

Anton wrote:

I see no benefits to having a getter and setter here. Deferred words are fine.

Anton, you wrote in comp.lang.forth on 2024-10-05: "I wish they had defined GET-BASE and SET-BASE instead of BASE".

You seem to see shortcomings of BASE. The shortcomings of the deferred word FORTH-RECOGNIZE are similar: if additional actions are needed on set or get the value, this is difficult to implement in a system and almost impossible in a program. And this word cannot be redefined by a program.

BerndPaysan [r1436] 2025-02-20 21:39:26

Multiline Strings

Anton, you seemed to miss the largest group that does it identical: Rust, Visual Basic (≥14), R, Ruby, and PHP. A number of languages who started with C-like strings did not continue to follow that example and then obviously needed a different syntax to stay backwards compatible (VB didn't have C-like strings to begin with and the multiline extension was compatible). If we introduce multiline strings as new feature, we should not first copy a bad example and then add another syntax to fix that.

The primary reason why C's multiline string style is so weird is the preprocessor: the preprocessor has a rudimentary understanding of the language, and it uses \ at the end of the line to concatenate multiple lines. That makes it create single-line entries out of strings, and then it understands that all this is just one string it shouldn't look inside (C macros aren't replaced in strings).

There's absolutely no need to copy C's weird strings caused by their weird preprocessor approach into Forth.

AntonErtl [r1437] 2025-02-22 09:09:43

Checking Ruby (the only of this bunch of languages that I have installed), I see that it indeed has strings that include newlines. So yes, there are programming languages that allow unescaped newlines in their most popular string syntax, instead of introducing a separate syntax for multi-line strings.

Why is this a mistake? The common case is that a string ends on the same line where it started. If the string terminator is missing on that line, it is often a mistake, and a friendly programming language has a syntax for single-line strings that allows catching the mistake right on that line. By contrast, in Ruby I get:

[~:155788] ruby
puts 'hello,
puts 'world'
-:2: syntax error, unexpected local variable or method, expecting end-of-input
puts 'world'

So it gives me a misleading error message on a different line from where the mistake happened, possibly several lines later. That's why many languages require either escaping the newline or a different syntax for multi-line strings.

The reason for that has nothing to do with backwards compatibility: These languages report an error if there is an unescaped newline in a string with the most popular syntax. Defining that case to do what you want rather than as an error does not break any existing, working programs.

The reason also has nothing to do with the C preprocessor. The C preprocessor has to know when something is inside or outside a string (it must not do macro expansion inside string literals), so it could just as well accept a newline inside a string.

I don't have an opinion on whether we should use an alternative delimiting syntax for multi-line strings, escape the newlines in the syntax for single-line strings, or have both options.

AntonErtl [r1438] 2025-02-22 10:04:11

Making translator a subtype of xt

Yes, having the translators not being state-dependent does not prevent people from performing state-dependent code (and it should not). But what it gives me is that if I avoid such code (and I do), I do not have to worry about state-dependence in every execute, compile,, is and defer!, only in state-translating.
Defining state-translate as an alias of execute does not help, because the state-dependence is in the words defined with translate:. Every other execute, compile,, is or defer! might still do something state-dependent because of that even if I have no other source of state-dependence in my program.
My preference is for translators without state-dependence. As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers without state-dependent translate: children that's just as convenient.

NOTFOUND

After complaints about the proposal being too long, Matthias Trute removed lots of the rationale in one version of his proposal, including the rationale for not using 0. Of course the same person who had earlier complained about the length then complained about incomprehensibility. Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.

FORTH-RECOGNIZE, deferred or getter and setter?

For base, for the optimization I have in mind, one would have to check on every use of # whether base has changed in the meantime, and that cost would be substantial compared to the benefit of the optimization. And it's not just set-base that would avoid the problem: If base was a uvalue (not a uvarue), it would be relatively easy to eliminate the check in Gforth.

For forth-recognize, I have no such optimization in mind. If I had, it would be relatively easy to implement in Gforth without change check for forth-recognize, because forth-recognize is a deffered word, not a variable.

But even on a system where you cannot attach the optimization to the defer! method of forth-recognize, inserting a change check would be much less of a problem than for base: forth-recognize tends to be much more expensive than #, so any optimization with noticable benefit will also reduce the cycles per invocation much more, easily amortizing the change check.

Anyway, given that nobody has proposed some actual benefit from having a getter and setter, we should follow Chuck Moore's advice here: Do not speculate. In this case, this means not introducing a getter and setter.

ruv [r1439] 2025-02-22 21:41:29

Making translator a subtype of xt

Defining state-translate as an alias of execute does not help, because the state-dependence is in the words defined with translate:. Every other execute, compile,, is or defer! might still do something state-dependent because of that even if I have no other source of state-dependence in my program.

Does this mean that, according to your idea, a Forth system is not allowed to define state-translate as an alias of execute?

Otherwise, if a Forth system is allowed to provide such an implementation, then defining state-translate as an alias of execute in your program is not distinguishable from such system's implementation. Other parts of your program simply should not know whether state-translate is alias of execute or not, and so don't depend on that fact.

In general, other parts can do something state-dependent regardless whether state-translate is alias of execute. One can write:

defer foo
: bar ... state-translate ... ;
' bar is foo \ `foo` is state-dependent now (in the general case)
: baz ... foo ... ; \ `baz` is state-dependent (in the general case)

On the other hand, if you do not have other sources of state-dependence in your program (including evaluate and include-file), and you only perform translators using state-translate, how can execute do anything state-dependent in your program other than calling something that calls state-translate?

NOTFOUND

Of course the same person who had earlier complained about the length then complained about incomprehensibility.

Just in case, it wasn't me who complained about the length ;-)

Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.

Thank you, there is a RECTYPE-NULL necessity section in the split-out comments.

In this section the author argues that RECTYPE-NULL (against 0) simplifies the implementation. But the author only considers cases when result of recognizing is used for translation. He does not consider cases when the result of recognizing is used to obtain a semantic token itself (a number, xt, nt, etc). Thus his argument did not hold in the first place. Because there is no point in simplifying a small part of a program at the expense of complicating a larger part. As I have shown, using 0 simplifies programs as a whole, an it is more consistent.

FORTH-RECOGNIZE, deferred or getter and setter?

Anton, I see that you consider only optimization and only in Gforth. I consider programs that extend standard Forth systems in general.

For example, if I want to implement append-perceptor ( xt-recognizer -- ) and prepend-perceptor ( xt-recognizer -- ), I may have to redefine the setter set-perceptor ( xt-recognizer -- ) and getter perceptor ( -- xt-recognizer ). This is impossible without a getter and setter.

BerndPaysan [r1440] 2025-02-23 16:28:37

The C preprocessor is by design line oriented, and can't see beyond a single line. This is unlikely most modern programming languages, which aren't line oriented anymore (Fortran and COBOL e.g. are line-oriented languages, and need line continuation characters, either & at the end in FORTRAN, or '-' in column 7 in COBOL in the next line). Forth is in many respects not a line-oriented language, but it has some line-oriented limitations (e.g. with PARSE).

What we should talk about is to escape line breaks if they shouldn't go into the output and are only in the string to facilitate editability. Then you can copy-paste a C multiline string, and it also works.

And when you forget the closing quote of a string in Forth, you get weird errors, even within the same line. The way to figure what goes wrong is by using a syntax highlighting editor that knows about strings (and if they go multi-line).

: .error-line ( line# error# -- )
  ." error  .  ." in line"  . ; 
*the terminal*:2:23: error: Undefined word
  ." error  .  ." in >>>line"<<<  . ;

Yes, I forgot the closing quote after “error”.

BerndPaysan [r1441] 2025-02-24 00:39:15

Making translator a subtype of xt

As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers without state-dependent translate: children that's just as convenient.

We had that before. It was less convenient.

The whole point is that the translator is or isn't state-dependent, depending on the language you are creating (if it is Forth, it is). The result of moving the state dispatch around showed that this is the position where you actually can get rid of it when your language doesn't have states. You actually don't get rid of the state-infested translator if you say “this is a table, and in order to handle what's in there in the interpreter, you need state-translating“. It's still infested with the concept of states. By putting state-translating into the interpreter, which is a reusable component (you can just replace the entire recognizer stack and read in different languages with normal words like included or evaluate), you force this concept upon all translators, whether their language has that concept or not.

We now have these direct access words (interpreting, compiling, and postponing), and their use is very limited. Two of the three serve as text for the prompt. postponing is used in postpone. And there's the possibility in Gforth, to extend these tables to further states for other languages, which reuse existing recognizers, by patching their operation into the additional field. The newly created operator is used to populate the tables, and to set the state, and that's it.

If you want to get completely rid of state in the long run, put it into the Forth-specific translators. If your modified Forth-like language doesn't need state anymore, your translators won't need it, either. And then, it's just gone.

ruv [r1442] 2025-02-26 16:36:23

Named translators

Anton wrote on 2025-02-15:

The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside

This has several disadvantages:

when you use a translator to translate a semantic token, you have to do it via execute (or compile a call using compile, directly);
- e.g.: xt-translator execute;
when a new translator is defined using other translators, you have to call them via execute (or compile,);
if you define new translators as colon definitions (which is very convenient), these translators do translation on execution, and if standard named translators return xt on execution — this will lead to inconsistency.

On the other hand, the need of ticking the translator words in recognizers is mitigated when we use a tick recognizer. Thus, instead of ['] translate-xt we can write 'translate-xt (or with back-tick in Gforth's parlance).

Gerund

Anton wrote on 2025-02-16:

It's not clear to me why the gerund form is used (INTERPRETING etc.),

I think, they are temporary quick and dirty names.

According to the naming convention of standard words, the names of these words must begin with an English verb, or be just an English verb, because they perform some actions with side effects, i.e. change some states (the reverse is not true).

BerndPaysan [r1443] 2025-03-06 01:10:47

Gerund

One reason for using this is that the usage of these words has been shown extremely limited (other than postponing, which is used once in postpone), and one of the remaining use cases was to print the current state as readable text in the prompt by just doing get-state id. (id. is getting from the xt to the nt, and then does name>string type).

Grammar-wise, it also looks more natural to use the gerund here.

Reply New Version