Proposal: minimalistic core API for recognizers
This page is dedicated to discussing this specific proposal
ContributeContributions
BerndPaysan [160] minimalistic core API for recognizersProposal2020-09-06 09:40:07
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- rectype )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
then be told that this is not the right way, even though it looks like it is working.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL
):
REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
XY.3 Additional usage requirements
XY.3.1 Data type id
rectype: subtype of xt, and executes with the following stack effect:
RECTYPE-SOMETYPE ( i*x state -- j*x )
state is:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i?x is the additional information provided by the recognizer.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
RECTYPE-NULL ( state -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer state @ swap execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: rectype-num ( n state -- )
case
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
forth-wordlist find-name-in dup IF ['] rectype-nt ELSE drop ['] rectype-null THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
0. 2swap >number 0= IF 2drop ['] rectype-num ELSE 2drop drop ['] rectype-null THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
2>r
2r@ rec-nt dup ['] rectype-null <> IF EXIT THEN drop
2r@ rec-num dup ['] rectype-null <> IF EXIT THEN drop
2r> 2drop ['] rectype-null ;
' minimal-recognizer is forth-recognizer
Testing
JennyBrien
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE
do something like:
-2 RECTYPE-X
But mostly you'll have the RECTYPE
sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:
: postponed -2 swap execute ;
and
: postponed @ execute ;
Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)
Compare:
: rectype-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
with:
: rectype: create , , , ;
:noname name>interpret execute ;
;noname name>compile execute ;
;noname name>compile swap lit, compile, ; rectype: rectype-nt
BerndPaysan
One possible thing is to have an automatic postpone for literals.
: rectype-lit: ( compile-xt "name" -- )
create ,
does> @ swap
case
0 of drop endof
-1 of execute endof
-2 of dup >r execute r> compile, endof
endcase ;
' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string
This works with this method, but not with the previous way.
BerndPaysan
Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define
: rectype: ( xt-int xt-comp xt-post "name" -- )
create , , , does> swap 2 + cells + @ execute ;
and then define generic rectypes just like in Matthias Trute's version with rectype:
JennyBrien
: rectype-lit: ( xt -- ) ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;
not so straightforward, but possible.
ruv
Previous works
In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: ( i*x token -- j*x )
.
I described this approach in comp.lang.forth in 2018 (news:pngvcc$pta$1@gioia.aioe.org).
Bernd should also remember comparison of version D with Resolvers API, where I specified this approach, and even several POCs.
and then define generic rectypes just like in Matthias Trute's version with rectype:
I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).
But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.
This works with this method, but not with the previous way.
Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.
: create-rectype-for-literal ( xt-compiler "name" -- )
['] noop swap dup rectype:
;
Token translator
Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
RECTYPE-SOMETYPE ( i*x state -- j*x )
By convention, the name for such a word should start from an English verb.
Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.
E.g.:
: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ;
VS
: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s r> r> tt-lit-s ;
Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?
Terminology
Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.
Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).
ruv
Advantages
A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!
BerndPaysan
Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.
ruv
@JennyBrien wrote
Compare: [...] with:
: rectype: create , , , ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ; rectype: rectype-nt
(sic: the full postpone action).
This comparison is incorrect since in the proposed API rectype:
(that generates a token translator) can be defined as the following:
: rectype: ( xt-executer xt-compiler xt-postponer "name" -- )
>r >r >r : ]]
0 of [[ r> xt, ]] endof
-1 of [[ r> xt, ]] endof
-2 of [[ r> xt, ]] endof
-22 throw
endcase [[ postpone ;
;
And you can use the same your code to define your rectype-nt
or anything else.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- rectype )
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
then be told that this is not the right way, even though it looks like it is working.
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- rectype )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL
):
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
XY.3 Additional usage requirements
XY.3 Additional usage requirements
XY.3.1 Data type id
XY.3.1 Translator
rectype: subtype of xt, and executes with the following stack effect:
translator: subtype of xt, and executes with the following stack effect:
RECTYPE-SOMETYPE ( i*x state -- j*x )
SOME-TRANSLATOR ( i*x -- j*x )
state is:
A translator depends on STATE
to translate the given arguments:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i?x is the additional information provided by the recognizer.
i*x
is the additional information provided by the recognizer.
XY.6 Glossary
XY.6 Glossary
XY.6.1 Recognizer Words
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
RECTYPE-NULL ( state -- ) RECOGNIZER
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer state @ swap execute
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
case
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: rectype-num ( n state -- )
case
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
forth-wordlist find-name-in dup IF ['] rectype-nt ELSE drop ['] rectype-null THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
0. 2swap >number 0= IF 2drop ['] rectype-num ELSE 2drop drop ['] rectype-null THEN ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
2>r
2r@ rec-nt dup ['] rectype-null <> IF EXIT THEN drop
2r@ rec-num dup ['] rectype-null <> IF EXIT THEN drop
2r> 2drop ['] rectype-null ;
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- rectype )
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
A translator depends on STATE
to translate the given arguments:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
BerndPaysan
Downside of using STATE
right in the dispatcher: POSTPONE
becomes more difficult. Instead of
: postpone ( "name" -- ) parse-name forth-recognizer -2 swap execute ; immediate
it is more convoluted
: postpone ( "name" -- )
parse-name forth-recognizer
state @ >r -2 state ! catch r> state ! throw ; immediate
How to detect [[
at the end of a postpone sequence is also not so trivial.
ruv
Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult.
It's OK. Actually, we distribute complexity among various parts. When we make one thing less complex, we make another thing more complex. But due to the different numbers of occurrences of various things (in systems, libraries, programs) the summary complexity can be less or more.
This approach also makes some things more complex, but the summary complexity decreases, I believe.
Concerning POSTPONE
. I think, some useful parts should be factored out.
Also, we don't need to catch exception — usually, it's a stop error, and the state is ambiguous in any case. QUIT resets all the internal states. Concerning programs — we need a standard way to reset the internal states of the Forth text interpreter, regardless of Recognizers proposal.
In my "lexeme resolvers" implementation I use conception of postponing level that can be 0, 1, 2, and introduce the words to increment and to decrement this level.
So, POSTPONE
is defined as the following:
: postpone ( " name" -- ) parse-name inc-state translate-lexeme dec-state ( flag ) ?nf ; immediate
Where translate-lexeme
is defined as the following:
: perceive-lexeme ( c-addr u -- k*x xt-tt | c-addr u 0 )
perceptor dup if execute then
;
: translate-lexeme ( i*x c-addr u -- j*x true | c-addr u 0 )
perceive-lexeme dup if execute true then
;
(Note that in contrast of this proposal, resolvers return ( c-addr u 0 )
on fail)
How to detect
[[
at the end of a postpone sequence is also not so trivial.
An appropriate approach is that the word ]]
is a parsing word.
: ]] ( -- )
inc-state begin
next-lexeme 2dup s" [[" equals 0= while
translate-lexeme ?nf
repeat 2drop dec-state
; immediate
So we don't have any problem to detect [[
at the end.
An advantage of the postponing level conception is that the following code works as expected:
: foo [ ]] 123 . [[ ] ; foo \ prints 123
In the message news:rdcur5$ga4$1@dont-email.me (the full message: news:rdcn35$sd2$1@dont-email.me) I showed another approach, when postponing action is not required at all (i.e., -2 state in this proposal).
ruv
translator: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
It's correct in the general case, but it makes a little sense, since any definition meets this stack effect.
So I think we should distinguish the parameters of a translator itself from the effect of translating of the code that is passed to the translator. Possible variants:
\ We can define 'token' data type
TRANSLATE-SOMETOKEN ( i*x token -- j*x )
\ Some hybrid variant
TRANSLATE-SOMETOKEN ( i*x token{k*x} -- j*x )
\ Only low level data types
TRANSLATE-SOMETOKEN ( i*x k*x -- j*x )
(NB: I use a conventional naming {verb}-{noun} for such a words).
It should be also noted that these x may be distributed in all the stacks: the data stack, the floating-pint stack, the control-flow stack (except token k*x, that cannot be in the contrlo-fow stack).
BerndPaysan
Indeed, TRANSLATE-SOMETHING
sounds better than SOMETHING-TRANSLATOR
.
FORTH-RECOGNIZER
is ok, because it's followed by EXECUTE
, so this is a noun.
ruv
"FORTH-RECOGNIZER" name
I thought about FORTH-RECOGNIZER
name.
It makes a strong impression that this word is similar to FORTH-WORDLIST ( -- wid )
. The problem is that it isn't.
FORTH-WORDLIST
is a constant (it always return the same value), that indicates a one the same word list among all the word lists. This word list can be included into the search order, and it can be absent in the search order.
By analogy, FORTH-RECOGNIZER
should be a constant that indicates a one the same recognizer among all the recognizers. This recognizer can be included into the recognizer that is used by the Forth text interpreter, and it can be absent in the recognizer that is used by the Forth text interpreter. (In accordance with the conception that a sequence of recognizers is also a recognizer).
All these should be right to hold consistent naming. But actually it is wrong. It means, that this name breaks consistency and isn't inappropriate for the proposed word.
FORTH-RECOGNIZER ( -- xt )
can be a word that returns xt of the system's recognizer that is used by the Forth text interpreter by default (i.e. initially).
FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.
Also, it makes a strong impression that it returns a recognizer. But it's wrong. Also, it's result is analyzed much more often than it's followed by EXECUTE.
Basic methods
By no means, we need
- a method that tells the Forth text interpreter to use a given recognizer.
- a method that returns the recognizer that is currently used by the Forth text interpreter,
- a method that performs the recognizer that is currently used by the Forth text interpreter
A one differed word (a vector) X can solve it:
- set:
IS X
- get:
ACTION-OF X
- perform:
X
But I insist that this approach limits implementations too much. A Forth system can want to perform its internal actions on switching the recognizer that is used by the Forth text interpreter. But it cannot do it, if this recognizer is switched via IS X
method. For that, the different getter and setter words are usually provided in the Standard (except very ancient BASE
and >IN
— due to back compatibility).
Yes, perhaps Gforth can attach any additional internal actions for IS X
phrase. But we shouldn't complicate all Forth system implementations.
A possible implementation via deferred word and distinct getter and setter words:
defer perceive ( c-addr u -- k*x tt )
: perceptor ( -- xt ) action-of perceive ;
: set-perceptor ( xt -- ) is perceive ;
Perhaps, the more specific names are better (?):
defer perceive-lexeme ( c-addr u -- k*x tt )
: lexeme-perceptor ( -- xt ) action-of perceive-lexeme ;
: set-lexeme-perceptor ( xt -- ) is perceive-lexeme ;
ruv
Correction: pleas read "By anyway, we need" instead of "By no means, we need".
BerndPaysan
´DEFERis a core word now, so using
DEFER` for such a thing is ok. We don't need a special getter and setter for everything.
The implication that FORTH-RECOGNIZER
returns a recognizer (and does not, it executes one) is a valid point. A better name is needed. At the moment it is a VALUE
and does return a recognizer. Now, it is a deferred word, and does recognize strings. We should keep it with Anton's unification: a sequence of recognizers can be combined to one recognizer. Just because it's now recognizing more different things, it's still a recognizer. No need to find another synonym. Takes string, returns data+translator token ? is a recognizer.
Maybe RECOGNIZE-FORTH
is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.
ruv
DEFER
is a core word now, so usingDEFER
for such a thing is ok.
Actually, DEFER
, as well as TO
, is a Core extension word, so it's optional. But it's another argument.
Back to my first argument, what do you suggest if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter?
You can ask, do I have an example of such requirement. Yes, I do. I want to provide a method to undo such switching in my system. It's similar to effect of the "PREVIOUS" word for the search order. Perhaps you can suggest some solution with the deferred word?
Anton's unification: a sequence of recognizers can be combined to one recognizer.
Yes. I too said that any sequence of recognizers seq-x (from API v4) can be represented as a single recognizer : recognize-x seq-x recognize ;
. So, sequences are excessive in the basic API, — a Forth system doesn't need to know is it a sequence or not.
Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.
It's better. But it recognizes not valid FORTH, but anything what the Forth text interpreter can currently recognize (and only that).
Conceptually, this word isn't just a recognizer. There is a single special system's slot for a recognizer that is used by the Forth text interpreter. We can put any recognizer into this slot. We can also perform the recognizer that is placed into this slot. So this word performs the recognizer from this slot. I incline to call this slot "perceptor". And after that the word that performs the recognizer from this slot becomes "perceive".
All recognizer names have the pattern RECOGNIZE-*. The idea is to not put this special word on a par with all other recognizers. For that, its better to find a name that is distinct from the RECOGNIZE-SOMETHING pattern. What do you think?
ruv
Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.
This argument is that a Forth system can be implemented as a minimal kernel and additional libraries. And DEFER
, IS
, ACTION-OF
can be available via a library. But when we put a deferred word into this API, we force a system's author to put DEFER
, IS
, ACTION-OF
into the kernel too. But actually they isn't required in the kernel. It would be too restrictive limitation on the implementations.
ruv
Locate
locate
cannot work for lexemes that can be recognized (translated) according to this proposal.
ruv
The last comment was intend for the proposal of AndrewHaley, and it was mistakenly placed here.
BerndPaysan
The recognizer will be an option, as well. At the moment, FORTH-RECOGNIZER
is proposed to be a value. That's also a CORE EXT word (as is TO
).
A minimalistic system that wants to implement recognizers needs FORTH-RECOGNIZER
to be a deferred word. I.e. it needs code for DODEFER
. It can load the rest of the deferred word stuff later as extension.
ruv
Certainly, recognizers is an option. I didn't mean that some required part requires an optional part. I mean that one optional part requires another complex optional part without any good and fair ground.
Yes, a minimalistic system that wants to provide a deferred word needs only code for DODEFER
. But it still makes bootstrapping of this system more complex. Hence, when we put a deferred word into API, we make things more complex for some implementations. But we don't even have a rationale for that.
Also, with deferred word we still don't have a solution if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.
BerndPaysan
CORE has only VARIABLE
as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER
has to be clumsy, i.e.
forth-recognizer @ execute execute
Clumsy interfaces can not be changed if you have better things at hand. You can probably wrap around the clumsy interface, e.g.
Defer recognize-forth
addr recognize-forth Constant forth-recognizer
if you can use ADDR
to access the deferred word's xt storage location. But then you have another interface, less clumsy, and only available when you have DEFER
+ADDR
(and ADDR
is not even part of the standard).
A minimalistic API, as what I am looking for here is one where you don't have to document much. The less uniform an API is, the more you have to document. The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt )
as stack effect. And combinations of recognizers have the same effect. And the system's recognizer is just another one, which you can swap in and out. And you can define a REC-SEQUENCE
, where you can manipulate the sequence, and put that into the system's recognizer.
This uniformity is broken when you don't use a deferred word for the system's recognizer — you can't just call that one as you can call the others. You need @ EXECUTE
. This is clumsy.
ruv
CORE has only
VARIABLE
as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e.forth-recognizer @ execute execute
I don't suggest to use a variable in the interface, — it's even worse than a defer. When a variable is used to change something, this changing cannot be effectively detected. But the requirement is: an ability for a system to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.
For that I would prefer to have the separate words in the API: a setter, a getter and a "performer" (a word that performs the recognizer that is currently used by the Forth text interpreter).
What are your objections to have several separate words in the minimalistic API?
The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect.
I strongly support this approach (and I myself suggested this approach too, with slightly different stack effects).
This uniformity is broken when you don't use a deferred word for the system's recognizer
It seems, the set of words like the following (the names may vary):
perceive ( c-addr u -- k*x tt )
set-perceptor ( xt -- )
perceptor ( -- xt )
doesn't brake the mentioned uniformity. Please, clarify.
BerndPaysan
Using special setters and getters means you have another (special purpose) DEFER
mechanism here. Of course you can implement that with
variable current-perceptor
: perceive ( addr u -- i*j token ) current-perceptor @ execute ;
: set-perceptor ( xt -- ) current-perceptor ! ;
: perceptor ( -- xt ) current-perceptor @ ;
which is probably a bit less implementation effort than DEFER
, IS
, and ACTION-OF
. Or really?
State-Smart:
: defer Create ['] noop , does> @ execute ;
: is ' >body state @ if ]] literal ! [[ else ! then ; immediate
: action-of ' >body state @ if ]] literal @ [[ else @ then ; immediate
or with NDCS:
: defer Create ['] noop , does> @ execute ;
: is ' >body ! ; ndcs: ' >body ]] literal ! [[ ;
: action-of ' >body @ ; ndcs: ' >body ]] literal @ [[ ;
DEFER
is really a lightweight way to define words that can be changed.
These three lines of code are doing more than the three lines of code you need in addition when you have your special-purpose setter and getter, but they are still one-liners.
Forthers like to reinvent the wheel. But don't overdo this.
ruv
Using special setters and getters means you have another (special purpose) DEFER mechanism here.
Not necessary. It's up to an author/implementer. It can be just wrappers over standard DEFER, as I shown earlier. So it doesn't mean reinventing the wheel. The implementation details are just hidden.
So the arguments concerning implementation of DEFER mechanism say nothing against three separate words in the minimalistic API.
BTW, having translators for the basic data types, the words is
and action-of
can be even shorter:
: is ' >body tt-lit ['] ! tt-xt ; immediate
: action-of ' >body tt-lit ['] @ tt-xt ; immediate
Well, in any case I would agree that the arguments concerning complexity are more or less weak.
A strong argument (that wasn't yet commented) is about additional actions that a system needs to perform in the setter. What do you thing in this regard?
ruv
One more strong argument against DEFER word in the API, and pro the different getter and setter is following.
Having DEFER in the API, we cannot define this API over another API at all. But having the different getter and setter (and "executer") — it's possible to defined this API over some other APIs.
Example: news:rn1csa$b02$1@dont-email.me
BerndPaysan
Gforth's new header structure allows to overload TO
, IS
(which are essentially the same) and DEFER@
, so we can use the DEFER
API to access similar changeable execution patterns implemented differently. So for us, it makes sense to use these access words, regardless how it is implemented.
Other systems may not have this capability, though the way the standard now extends TO
for FVALUE
and others, you need to have one way or the other to deal with that. Same, when you have an UDEFER
in your system for user-specific deferred words.
For me, it is needless clutter of the dictionary and the mental space of the programmer to add setters and getters for things where you already have a generic one. But I see the point that not every system can do this.
ruv
needless clutter of the dictionary and the mental space of the programmer
I used an approach when a defined word creates two words — a getter and a setter.
It's something like after the phrase create-prop x
the words x
and set-x
are created. I didn't noticed any mental space clutter in this regard. Sometimes I redefined set-x
to add additional checks or actions.
Concerning dictionary space — I don't see any problem.
But I see the point that not every system can do this.
True. And even if a system can do this, it's done in some system specific way only.
So, due to the combination of all reasons, it's better to have distinct ordinary words in the standard API.
StephenPelc
If people are interested, I can arrange a virtual meeting for recognisers. They have been workshopped at various Forth Standards meetings but little of substance has emerged so far. I would suggest that such a meeting concentrate on finding what we can agree on.
Note that Forth-200x meetings are public, and the use of real names is strongly encouraged.
ruv
If people are interested, I can arrange a virtual meeting for recognisers. ... concentrate on finding what we can agree on.
I like this idea.
If people are interested, I will prepare before the meeting a proof of concept — an implementation of Recognizer API v4, Nestable Recognizer Sequences, or some other over this API.
Perhaps, somebody could share his list of questions before the meeting. My list at GitHub.
StefanK
A small remark to the POSTPONE test.
We can factor postpone in two parts with state-execute similiar to base-execute:
: state-execute ( xt s -- ) state@ >r state ! catch r> state ! throw ;
: POSTPONE ( "name" -- ) parse-name forth-recognizer -2 state-execute ; immediate
That's not very difficult anymore.
StefanK
IMHO the idea to use a deferred forth-recognize is good and more flexible than a stack of recognizers. But the translator xt makes postpone more difficult. But we can factor postpone into two parts. One that restores the stack contents at runtime similar to lit,
, and one that does the compilation. If we use rectype, similar to the proposal of recognizers from 2018, but with lit,
as third method, we get an easy postpone
and '
. Here, we can reuse the compile method directly by the second factor of postpone.
variable state
: translate>interpret @ ;
: translate>compile cell+ @ ;
: translate>lit, cell+ cell+ @ ;
\ Well, its translate>*lit, in fact; i.e. regenerate ( i*x ) at runtime.
Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer perform ( i*x translator -- j*x )
: perform>interpret translate>interpret execute ;
: perform>compile translate>compile execute ;
: on -1 swap ! ;
: off 0 swap ! ;
: [ ['] perform>interpret is perform state off ; IMMEDIATE
: ] ['] perform>compile is perform state on ;
\ alternativly:
\ :noname is perform state ; dup
\ : [ ['] perform>interpret [ compile, ] off ; IMMEDIATE
\ : ] ['] perform>compile [ compile, ] on ;
\ another alternative:
\ : perform state @ IF translator>compile ELSE translator>interpret THEN execute ;
' [ execute \ initialize state and perform
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer perform
REPEAT ;
: lit, ( n -- ) lit lit , , ; \ or postpone literal
: throw-13 -13 throw ;
: translator ( xt-*lit, xt-compile xt-interpret "name" -- )
create , , , ;
' throw-13 dup dup translator notfound
' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt
' lit,
' lit,
:name ; \ noop
translator translate-const-cell
: rec-nt ( addr u -- nt translate-nt | notfound )
forth-wordlist find-name-in dup IF translate-nt ELSE drop notfound THEN ;
: rec-num ( addr u -- n translate-const-cell | notfound )
0. 2swap >number 0= IF 2drop translate-const-cell ELSE 2drop drop notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator | n num-translator | notfound )
2>r 2r@ rec-nt dup notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
\ simple postpone
: postpone ( "name" -- )
parse'n'recognize
dup translator>compile >r translator>lit, execute r> compile,
; IMMEDIATE
\ postpone optimized for immedate words
: postpone ( "name" -- ) \ optimized for immediate words
parse'n'recognize \ ( i*x translator )
dup translate-nt = IF ( nt translator )
over immediate? IF drop >cfa compile, exit THEN
THEN
dup translator>compile >r translator>lit, execute r> compile, ;
; IMMEDIATE
: ' ( "name" -- xt ) parse'n'recognize translate-nt <> IF throw-13 THEN >cfa ; IMMEDIATE
ruv
@StefanK, thank you for your participation. But it looks like you have missed too many arguments discussed above.
For example, a deferred word forth-recognizer
has a confusing name, and it cannot be acceptable in the API, since it's difficult for the Forth system to detect when its value is changed (NB: it isn't an argument in favor of "stack of recognizers").
: translator ( xt-*lit, xt-compile xt-interpret "name" -- ) create , , , ;
' lit, :noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ; :noname ( i*x nt -- j*x ) >cfa execute ; translator translate-nt
Also, could you please stick to a consistent and clear terminology?
In this example you create not a token translator, but a named token descriptor (and the corresponding token descriptor object). See Common terminology for recognizers (improvements and critics are welcome).
token descriptor object: an implementation dependent data object that describes how to interpret, how to compile and how to postpone (if any) a token .
I also proposed the following naming convention for the corresponding words:
- For token translators use names in the form
tt-*
— that is the abbreviation oftranslate-token-*
; for example,tt-lit
,tt-nt
. - For token descriptors use names in the form
td-*
— that is the abbreviation oftoken-descriptor-*
; (for example,td-lit
,td-nt
)
The employed approach in your example to create a token descriptor can be called "three components" approach. A significant disadvantage of this approach is that it doesn't provide a way to reuse old descriptors when you create a new descriptor. Compare to token translators — they can be easily reused to create new token translators. For example, a token translator for a pair ( nt nt ) can be created using the token translator tt-nt
for a single nt as:
: tt-2nt ( i*x nt nt -- j*x ) >r tt-nt r> tt-nt ;
To create a token descriptor td-2nt
in the three components approach, you need to put in a lot more effort, and you cannot reuse td-nt
descriptor.
One possible solution is to don't expose the three components approach in the API and instead provide a special method to create a descriptor from another descriptors.
For td-2nt
it can look as:
tt-nt dup 2 descriptor constant tt-2nt
\ or
td{ tt-nt tt-nt }td constant tt-2nt
It seems, a user never needs to provide three components for a new descriptor since any new descriptor is always based on some already defined descriptors.
But the approach based on the token translators is far simpler.
By the way, a well known word to get xt from nt is name> ( nt -- xt )
(see Forth-83 / "C. Experimental proposal" / "Definition field address conversion operators").
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
XY.3.1 Recognized
translator: subtype of xt, and executes with the following stack effect:
recognized: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
RECOGNIZED-THING ( j*x i*x state -- k*x )
A translator depends on STATE
to translate the given arguments:
A recognized xt acts on the state passed to it on the stack
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND
if not.
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
Extensions reference implementation:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
Stacks TBD.
Testing
TBD
ruv
A recognized xt acts on the state passed to it on the stack
A proper term for "recognized xt" ("recognized execution token") should be chosen. "recognized xt" means "xt that is recognized", but we don't recognize execution tokens, but recognize lexemes. This xt just is a result of recognizing a lexeme. And it should be named according what it does, not according who produces it.
There is no reason to pass state on the stack — we discussed that, and the reference implementation reflect that.
BerndPaysan
The STATE
discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE
. The reference implementation needs to be adjusted.
For the name of the result values we might want to have another round of bikeshedding. In particular with more native speakers. The current wording represents the last round of bikeshedding.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Recognized
recognized: subtype of xt, and executes with the following stack effect:
RECOGNIZED-THING ( j*x i*x state -- k*x )
A recognized xt acts on the state passed to it on the stack
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
: recognized-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
: recognized-num ( n state -- )
case
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
forth-wordlist find-name-in dup IF ['] recognized-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
0. 2swap >number 0= IF 2drop ['] recognized-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Stacks TBD.
Testing
TBD
ruv
The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE.
I see the following in the report by @ulli on 2021-09-08:
Given the two variations to handle STATE (either in RECOGNIZER:'s DOES> part or in INTERPRET), yesterdays participants favoured to have the single occurrence of STATE in INTERPRET. Further investigation and model implementations will show whether on or the other is beneficial.
So it implies further investigation and model implementations.
Could someone provide a rationale in favor to pass state (better say "mode") via the stack?
My rationale against mode on the stack is following:
- It makes combination of token translators cumbersome. E.g. a definition
: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ;
becomes far more complex. - In most cases a program doesn't need to execute a token translator in a mode that is different from the current mode (counter examples are welcome, except
postpone
). - The current mode is already held by the system anyway.
- (most importantly) It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API. This loop does not need to know anything about modes and
STATE
at all. If we are replacing the system's lexeme translator (along with the system's set of token translators), we should be able to replace it along with the system'sSTATE
(and the set of the system's modes) too. Moreover, a token translator can technically ignore the passed value and use it's own set of modes. And even such a simpler mode-beyond-stack API can be implemented over one that passes mode via the stack.
On the other hand I don't think that including (mentioning) STATE
in a new API is a good choice. STATE
returns a read-only address, and it's provided for back compatibility only. So a better method instead of STATE
is required anyway.
Actually, the system's token translators are the only ones who depend on the system's set of modes. In most cases user-defined token translators are defined via system's token translators (which should be standardized) and they need to know nothing about system's set of modes, and about STATE
at all. In the same time, a user is able to define own set of recognizers and set of token translators that don't depend on system's set of modes, but introduce own set of modes.
So, the specification for Recognizer API should not mention nether STATE
nor a set of magic values like {0, -1, -2}.
Concerning your mode -2
— I believe, the standard word postpone
doesn't need an own mode. But in postponing mode, if any, string literals (like s" foo bar"
) and comments should be properly treated.
ruv
- It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API.
See a block-based illustration of this idea in my Gist.
ruv
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- ) create , , , does> state @ 2 + cells + @ execute ;
In the reference implementation you keep the mode {0,-1,-2} in the state
variable.
But it's problematic, regardless how the mode is passed into token translators (directly via the stack, or indirectly using a dedicated method).
Since, when interpretation state is set by [
(so state
is set to 0
), and then compilation state is set by ]
, the mode should be the same as before [
. If it was -2
, it should be set to -2
. But information that the mode was -2
is lost. So another variable should be used to keep a flag whether "POSTPONE" mode is active or not.
Actually, the mode of compilation/interpretation and "POSTPONE" mode are not mutual-exclusive. They can be set independently of each other.
For example, the code:
: foo postpone bar [ postpone baz ] ;
conceptually can be pretty clear defined (see my comment). In this fragment, for the lexeme bar
"POSTPONE" mode is active in compilation state, for baz
"POSTPONE" mode is active in interpretation state.
So, if "POSTPONE" mode is employed, a different variable for it should be used for this reason too.
On the other hand I'm not convinced that we need "POSTPONE" mode at all.
Except to implement the word postpone
itself, where and how this mode can be used? Even for the questionable construct ]] ... [[
the mode "POSTPONE" isn't needed.
BerndPaysan
OK, the most convincing argument is that STATE
can go away as specified thing. You can use and combine system translators, and you can create table-driven translators, but STATE
is an implementation detail.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Recognized
XY.3.1 Translator
recognized: subtype of xt, and executes with the following stack effect:
translator: subtype of xt, and executes with the following stack effect:
RECOGNIZED-THING ( j*x i*x state -- k*x )
THING-TRANSLATOR ( j*x i*x -- k*x )
A recognized xt acts on the state passed to it on the stack
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: recognized-nt ( nt state -- )
case
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: recognized-num ( n state -- )
case
: translate-num ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] recognized-nt ELSE drop ['] notfound THEN ;
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] recognized-num ELSE 2drop drop ['] notfound THEN ;
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) cell+ @ execute ;
: translate-post ( translate-xt -- ) @ execute ;
Stacks TBD.
Stacks TBD, copy from Trute proposal.
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
THING-TRANSLATOR ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) cell+ @ execute ;
: translate-post ( translate-xt -- ) @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Stacks TBD, copy from Trute proposal.
Stack library
: STACK: ( size "name" -- )
CREATE 1+ ( size ) CELLS ALLOT
0 OVER ! \ empty stack
;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP NOTFOUND
;
: recognizer-sequence: ( rec1 .. recn n "name" -- )
dup stack: dup cells negate here + set-stack
DOES> recognize ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
THING-TRANSLATOR ( j*x i*x -- k*x )
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 1+ ( size ) CELLS ALLOT
0 OVER ! \ empty stack
;
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP NOTFOUND <> IF
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP NOTFOUND
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
dup stack: dup cells negate here + set-stack
DOES> recognize ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
AntonErtl
It seems to me that, given the reference implementation
' translate-nt translate-int
' translate-num translate-int
' translate-dnum translate-int
does not work (nor with translate-comp nor translate-post). Assuming you solve this, do you really want me to define, e.g.,
:noname ['] translate-nt translate-int ;
to get an xt equivalent to one of the xts that has been passed to translate:
?
How do you implement POSTPONE (IIRC Matthias Trute has a reference implementation for that)?
What problem is solved by making all the translators state-smart? The problem I see is that you can only access the individual actions by saving state
, setting state
, executing the translator, and restoring the state. That's not a good design.
The specification of translate:
mentions a "current mode". Where do I find out what a "mode" is? This is non-standard terminology.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Translate as in interpretation state
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Translate as in compilation state
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Translate as in postpone state
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysan
I removed the access words to the xts for a reason: We don't do it that way in Gforth, and we actually found little use of those words.
There are (at least) three ways to create an interface to these operations:
- Field-like, i.e. you can read and write the xts for interpretation/compilation/postpone. The typical usage is ( translator )
TRANSLATE-
<state>@ EXECUTE
. - Valuefield-like, as in the original Trute proposal. You can read the xts for interpretation/compilation/postpone without an extra
@
, but you can't write them, unless your access word also implementsTO
. The typical usage is ( translator )TRANSLATE-
<state>EXECUTE
. - Deferfield-like, which is what Gforth does. Here, not only the
@
is part of the operation, but also theEXECUTE
(really a tail-call variant of it). You can neither read nor write the xts, unless the access words also implementIS
andACTION-OF
. The typical usage is ( translator )TRANSLATE-
<state>, and that looks about right. Gforth uses different names to not collide with the proposal here.
Gforth offers as an extension to add more states and thus more access words, and that extension also adds IS
, TO
(which are synonyms) and ACTION-OF
to the existing (it is only one, only for postpone state you need it explicit) access word, and also implements the other two for interpret and compile, which are never used on their own. Of course when you add a new state, you need to specify what existing translators do on that state, so IS
becomes necessary, and ACTION-OF
just comes for free through Gforth's way of implementing TO
and variants, of which ACTION-OF
is one. This extension is non-standard, and not proposed here, it is used for creating obscured (“tokenized”) source code and reading name=value-style config files.
The experience so far is that outside of this extension, there's only one of those three access words needed at all, which is TRANSLATE-POSTPONE
, and it is exclusively needed inside the standard word POSTPONE
itself, a word where the implementation is left up to the system anyways. So the usage of these words is extremely limited. Therefore, I deleted them and suggest not to standardize these words, following the “don't speculate” rule and the topic of this proposal to make a minimalistic API, which contains only what's necessary. These words are of little use, and therefore there's no need to standardize them.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Translates a number.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
AntonErtl
Making the translators depend on state
is a bad idea. It means that everything using the translators becomes infected with this state-dependency. It also means that you cannot implement postpone
or ]]
...[[
as standard-compliant code (while, with state
-independent translators you could).
Moreover, when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state
before executing the translators, which is perverse. And in the case of colorforth-bw, again there is no standard way to set state
to get the translator to perform xt-post.
BerndPaysan
The experience with the usage in Gforth (non-standard extensions excluded) shows that direct calls to translators with a specific state are limited to postpone
, which is compile-only and therefore
: postpone ( "name" -- )
-2 state ! parse-name forth-recognize execute -1 state ! ; immediate compile-only
is not generating surprises (postpone
is expected to leave the system in compilation state after it has done its work). In Gforth, ]]
and [[
are implemented by changing state, and for recognizing the super-immediate [[
a special recognizer is added to the stack which returns a translator that has a specific postpone effect that changes back to compilation state and drops the additional recognizer from the stack.
' noop dup :noname ] forth-recognizer stack> drop ; translate: translate-[[
The state
-dependent invocation is the 99.9% case for translators, and that includes ]]
and [[
.
The Forth outer interpreter depends on state
(or a similar internal representation). The object that deals with the different actions depending on state
is the translator.
The proposal allows you to implement other ways to access the individual methods of a translator, if you need them. It does not encourage anymore to use translators as building blocks for other translators, and we can add wording that only translators created by translate:
are standard-conforming. Since there's little use for these other access methods, it does not suggest to standardize those.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
ruv
Anton writes:
Making the translators depend on state is a bad idea.
It's a subject of terminology. A translator depends on state by definition:
to translate a token: to interpret the token if interpreting, or to compile the token if compiling.
token translator: a Forth definition that translates a token; also, depending on context, the execution token for this Forth definition.
If you want a recognizer to return not a execution token, but some opaque identifier, I suggest to call it "descriptor".
token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.
token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value (i.e., a constant), or a token descriptor object itself.
BerndPaysan
As said, this is just about moving things around. There's little difference if you use translator-execute
or execute
on a translator as specified way to get from the translator to its state-dependent action. It's all system-dependent and hidden, and systems might implement it without even referring to STATE
and only update STATE
to reflect compilation and interpretation state and otherwise never look at it, and the way the system internally keeps its state can be completely different. The most obvious difference is that with translator-execute
, you need another word.
The fact that some abstract data type is executable does not mean EXECUTE
is the only way to operate on it. Recognizer sequences are executable in this proposal, and they still can be read out with and set by GET
/SET-RECOGNIZER-SEQUENCE
. So you can't just define them as colon definitions, you need to go through RECOGNIZER-SEQUENCE:
to define them.
Though I don't propose to standardize this, the proposal also suggests to make word list ids executable, and put them together in a recognizer sequence called search-order
. word list ids still have be used in other ways, e.g. to add new words to them, and the details are left to the system; but it is clear that they can't be normal colon definitions.
Not providing an abstraction like either translator-execute
or execute
, and instead putting it directly as state @ abs cells + @ execute
into the outer interpreter is a really bad idea, because all details of the reference implementation in which this sequence works become then part of the standard. Other ways to implement it, which may have performance advantages, or not expose the postpone state in STATE
would then not be allowed. The following implementation should be standard, too:
: do-translate ( translate-body -- ) 0 + @ execute ;
: state! ( state -- ) dup state ! abs cells ['] do-translate >body cell+ ! ; \ assume threaded code
: translate: ( int-xt comp-xt post-xt "name" - - )
create swap rot , , ,
does> do-translate ;
: [ 0 state! ; immediate
: ] -1 state! ;
: ]] 2 cells ['] do-translate >body cell+ ! ; immediate \ STATE left as is
How to recognize [[
is left as exercise to the reader, hint: a recognizer is a good idea, because it actually provides something that is executed at postpone time.
AntonErtl
By comparison, with the first version of this proposal postpone
can be implemented like this:
: postpone parse-name forth-recognize -2 swap execute ; immediate
which would not contain non-standard usage like -2 state !
, and it would also work in interpret state (not the most important feature, but a feature nonetheless). And ]]
could also be implemented as a standard program.
I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters. Other uses may be rare, but they exist, and people may come up with more over time if we make the interface flexible enough. The proposal does not propose to standardize state-independent ways to get at the functionality. Therefore, if the proposal is accepted, they don't exist for standard programs, and therefore they are not counterarguments against the disadvantages of the proposed state-dependent-only translators. The fact that this state-dependence means that you cannot use rectypes/translators to build other rectypes/translators is another (minor) argument against the state-dependence.
Concerning having a state-independent rectype as an abstract data type, the first version of this proposal proposed that rectype is an executable word with stack effect ( i*x state -- j*x )
where state would be 0, -1, or -2. This does not expose anything about the internals, and even allows to define rectypes without using a special defining word. The invocation in the text interpreter is ( i*x rectype ) state @ swap execute
, and in postpone
it´s as shown above.
Alternatively, if the rectype is the address of some data structure, yes, we would need an additional word, maybe rectype-translate ( i*x rectype n -- j*x )
that performs the access to the data structure. The usage in the text interpreter would be ( i*x rectype ) state @ rectype-translate
and the usage in postpone
would be ( i*x rectype ) -2 rectype-translate
.
ruv
Bernd writes:
The most obvious difference is that with
translator-execute
, you need another word.
Yes, essentially I agree concerning translator-execute
and execute
alternatives.
Yet another difference is that with translator-execute
the Forth text interpreter (the outer loop) should know this additional word (probably it means more degree of coupling). But with execute
— it should not know any additional word.
to make word list ids executable [...] but it is clear that they can't be normal colon definitions
Another example is defer-words (words created by defer
), which are executable but are not normal colon definitions — defer!
and defer@
can be applied to their xt.
The following implementation should be standard, too
The provided implementation for ]]
is system dependent, namely it depends on implementation of Recognizers API.
But, anyway, Gforth's ]]
can be implemented in a standard way via postpone
.
BerndPaysan
A translator is the address of a data structure, which also happens to be executable. This is not a contradiction! And there was a proposed standard way to access fields directly, renamed from the Trute proposal (but with otherwise identical, value-field like semantics) to INTERPRET-TRANSLATOR
, COMPILE-TRANSLATOR
, and POSTPONE-TRANSLATOR
. The reason I deleted these is that we don't even use them in Gforth, we only use >POSTPONE
, which has a different effect (it does not read out the xts, it executes it right away). If there is consensus that this is the right interface (not a value-field, but a defer-field), I can add this back to the proposal; as well as adding a standard way to set the state without knowing the internals of the system, for which the file recognizer-ext.fs
in Gforth also provides a suggestion:
: translate-state ( translator-access-xt -- )
\ takes a translator access xt, and may check if that actually is one
>body @ cell/ negate state ! ;
The hypothetical more performant implementation in Reply 1043 would have a different translate-state
, which would contain something like
>body @ ['] do-translate >body cell+ !
and only change STATE
for interpret/compile.
This proposal is minimalistic on purpose and does not cover all corner cases, especially not those where no consensus has been reached yet.
I consider the magic number dispatch method proposed earlier as not appropriate: this is tied to a specific implementation, and not a good interface. Method invocation or field access should be done by named access words, not by numbers.
ruv
Anton writes:
postpone can be implemented like this
postpone
can be implemented in any variant of the Recognizer API, with more or less code.
A difference is whether the behavior of postpone
can be extended/changed without redefinition of postpone
.
My point: if users need to extended behavior of postpone
without redefinition, then a special method can be specified for that. OTOH, postpone
(and ]]
) is a poor man's "postponing mode". An example of a more convenient tool is my c-state PoC, which provides a better tool for users, and it even supports any new user-defined special words.
I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters.
It's not an argument, since the API can provide words like compile-token
, execute-token
, postpone-token
, having ( i*x xt.translator -- j*x )
or ``( ix rectype -- jx )`, which are state-independent and don't restrict usage in the mentioned way.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
- 2023-09-15 Add list of example recognizers and their names.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
Obtain the recognizer sequence xt-seq as n*xt n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT
Translates a number.
TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT
Translates a double number.
TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT
Translates a floating point number.
TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT
Translates a string.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Recognizer examples
REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.
REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.
REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.
REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.
REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO
-like operations of value-like words:
* ->
value as TO
value or IS
value
* +>
value as +TO
value
* '>
value as ADDR
value
* @>
value as ACTION-OF
value
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${
name}
and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.
REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns translate-complex on success.
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator. - 2023-09-15 Add list of example recognizers and their names.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.
SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as n*xt n.
TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT
Translates a number.
TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT
Translates a double number.
TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT
Translates a floating point number.
TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT
Translates a string.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Recognizer examples
REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.
REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.
REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.
REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.
REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO
-like operations of value-like words:
* ``->``*value* as ``TO ``*value* or ``IS ``*value*
* ``+>``*value* as ``+TO ``*value*
* ``'>``*value* as ``ADDR ``*value*
* ``@>``*value* as ``ACTION-OF ``*value*
->
value asTO
value orIS
value+>
value as+TO
value'>
value asADDR
value@>
value asACTION-OF
value
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${
name}
and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.
REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns translate-complex on success.
Testing
TBD
BerndPaysan
Things to discuss, because there are still too many variables.
ToDo:
- Rename Recognizers from
REC-
result toRECOGNIZE-
result. A solution for.RECOGNIZERS
drowning the reader inrecognize-
could be to skip that prefix, because all recognizers are supposed to have the same prefix, anyways. - Revert the name of translators to rectypes or some similar word showing that this does describe a type?
- Add mode/state-specific access words to the translators again and decide on how they work. I prefer defer-field likes, which right away execute the corresponding action, and not put an xt on the stack for consumption. Defer-fields could work together with
IS
andACTION-OF
to access the xts within (in Gforth, they do).
Answers to some questions:
A lot of thoughts went into it to make different subsets of this proposal useful on their own, and allow different implementation strategies. The answer to “can I do without feature X” is most likely yes. You can use the subset of the features you want. Stripping away too much results in a subset no longer usable.
- Opening up the whole idea to small systems is useful to gain wider use.
FORTH-RECOGNIZE
is a deferred word in the reference implementation on purpose, and that allows changing it without adding more words. To add more implementation options, you can use the setter and getter words (which are optional) if you don't want to implement it as deferred word to swap in and out named sequences.- The recognizer sequences do have words to get and set the sequence, so you can just work with a single sequence and set/get it if you like. The nesting capability comes by the magical fact that a recognizer sequence has the same stack effect as a recognizer.
- You can do without both, because recognizer sequences can be written as colon definitions “by foot”.
- Named sequences are useful, especially when you swap in recognizer sequences for applications that do something completely different than the Forth recognizer sequence. If you do not want to support named sequences, you can still provide the one single named sequence
FORTH-RECOGNIZE
, and allowSET-RECOGNIZER-SEQUENCE
andGET-RECOGNIZER-SEQUENCE
to operate just on that. That's also an option where recognizers are useful without havingFORTH-RECOGNIZE
being deferred and noRECOGNIZER-SEQUENCE:
. - The
NOTFOUND
return for failure is there so that you can alwaysEXECUTE
the result ofFORTH-RECOGNIZE
and don't have to check for errors there.
Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING
no longer has the corresponding string on the stack, but needs parsing it later. Actually, parsing should happen in PARSE-NAME
. It still seems to be a hack that doesn't have a perfect solution.
ruv
Rename Recognizers from
REC-
result toRECOGNIZE-
result
In general, an abbreviation or acronym may be acceptable to me. But in this case I prefer RECOGNIZE-
rather than REC-
. The main disadvantage of rec
if that it has misleading associations. And the main advantage of recognize
is that it's a whole English word that is very appropriate for our case.
The part referred as "result" should not be a result (of recognizing), but the expected type of the input lexeme. Have a look in your examples — REC-NUM
and REC-TICK
produce the same result type translate-num
, but they accept different types of input lexemes, and these types are identified by NUM
and TICK
symbols correspondingly.
Thus, the naming form for recognizers can be expressed as RECOGNIZE-{lexeme-type-symbol}
.
Revert the name of translators to rectypes or some similar word showing that this does describe a type?
It does describe a type of what? It describes a type of a token i*x
, which is a result of recognizing. Actually, a token translator identifies the type of a token i*x
, which is a result of recognizing. Then, a token translator is a token type in the same time.
If we want to reflect this idea, we can use the acronym tt
, which stands for both: token translator and token type. Then, token translators can be named according to the form TT-{token-type-symbol}
. It looks elegant to me.
The names of translators are used for two purposes: to call a translator (for example, when we define a new translator via existing translators), and to obtain xt of a translator (which is an identifier for a token type in the same time) — to analyze a result of recognizing. The prefix tt-
looks good in these both case.
ruv
An example of use translators for two different purposes:
\ use "tt-lit" and "tt-2lit" just to call these token translators:
: tt-3lit ( 3*x -- 3*x | )
>r tt-2lit r> tt-lit
;
: recognize-forth-lexeme ( sd -- i*x tt ) forth-recognizer execute ;
\ use "tt-xt" to analyze a token type:
: recognize-tick ( sd -- xt tt.xt | 0 )
"'" match-head 0= if 2drop 0 exit then ( sd2 ) \ the input lexeme without the leading tick
['] recognize-forth-lexeme execute-balance2 ( i*x tt|0 n.data-stack n.float-stack )
2>r dup ['] tt-xt = if 2rdrop exit then drop 2r> fndrop ndrop 0
;
In this implementation for recognize-tick
(not tested), the phrase 'foo::bar::baz
will work correctly and returns xt of the word baz
in the wordlist bar
in the wordlist foo
, when recognize-pqname
for the syntax "::" (example) is a part of forth-recognizer
.
To implement this, we do a nesting call of the forth recognizer for another lexeme and then analyze the returned type. If the returned type is not appropriate, we drop the token (from the data stack, and from the floating-point stack, if any). So we need to be sure that calling recognize-forth-lexeme
never causes any side effect (other than stacks), even when recognizing succeeds.
NB: when recognize-tick
is a part of the current forth-recognizer
, executing of recognize-forth-lexeme
on some inputs will produce indirectly recursive call of recognize-forth-lexeme
(as intended).
ruv
Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because
TRANSLATE-STRING
no longer has the corresponding string on the stack, but needs parsing it later.
It's pretty allowed for a translator to parse the input buffer and/or read the input stream. Some token translators will even do nesting calls of the Forth text interpreter and can throw exceptions.
A problem that a part of the string can be in the input buffer (or even in the input stream) is solved via introducing two translators for strings: one accepts the full string from the stack (e.g.
tt-slit
), and another (e.g.tt-slit-parsing
) accepts the starting part from the stack, and the tail from the input buffer (or input stream). The string recognizer returns one or another depending whether a lexeme is a completed string, or the start of the string only.
I published a reference implementation in 2019, and now updated it for the current proposal.
A string recognizer can be as follows:
: quot ( -- sd.quot ) s\" \"" ;
: recognize-string ( sd.lexeme -- sd tt.slit|tt.slit-parsing | 0 )
quot match-head 0= if 2drop 0 exit then quot match-tail if ['] tt-slit exit then
2dup quot contains if 2drop 0 exit then \ fail if '"' is found in the middle of the string
['] tt-slit-parsing
;
BerndPaysan
The code which I have simply looks like this:
['] translate-string of json-string! endof
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
Thinking a bit more about that, I found:
- The thing you want to nest is the translator for names
- As I said, names should be first, numbers second and the rest third
- We have nestable recognizer sequences
So one solution would be to put all recognizers that return nts+translate-nt (or variants of that, e.g. locals have a variant of translate-nt that differs for postpone) in one recognizer stack, which has a name, and can be called without calling the entire recognizer stack. These recognizers have now a predictable effect, and no side effect. Since you can't tick locals, you still have to check for translate-nt
, but that's ok. You don't have to go through all weird other recognizers.
In Gforth, .recognizers
now can handle and display nested recognizers, and if you split this up like that, it would output:
.recognizers ~names ( ~nt ( Forth Forth Root ) ~scope ) ~numbers ( ~num ~float ) ~others ( ~string ~to ~dtick ~tick ~body ~complex ~env ~meta )
The ~
is there to abbreviate recognize-
(or rec-
now).
This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.
The other solution is what Gforth does: There's a ?REC-NT
which does the nesting, the checking for translate-nt, and the cleaning up of the side effects (stacks and >IN
). There is the possibility to make this more generic, e.g. create a word TRY-RECOGNIZE
which gets an xt, passes that to the result, and if that returns false, everything is cleaned up and false is returned, otherwise whatever that xt left (including the flag) is returned.
The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN
can change, it's just a little bit more hustle.
ruv
One correction. I wrote:
If we want to reflect this idea, we can use the acronym
tt
, which stands for both: token translator and token type.
It should be read as:
If we want to reflect this idea, we can use the acronym
tt
, which stands for both: "translate token" (verb) and "token type" (noun).
Data type symbol
To specify formal requirements, we have to introduce a new data type for token translators, which is a subtype of xt
. And the abbreviation tt
is a good candidate for this data type symbol.
If we will have the data type tt => xt|0
, and the symbol sd
for the string data type, the naming convention along with the stack diagram for a recognizer can be expressed as:
RECOGNIZE-{lexeme-type-symbol} ( sd.lexeme -- i*x tt ) ( F: -- j*r )
ruv
@BerndPaysan writes:
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
This would be a valid argument if it were possible to return something useful from a recognizer in all use cases but single-line string literals. But it's impossible.
For example, a recognizer for multi-line string literals cannot parse the full string literal without refilling the source (see my PoC implementation). Should we also restore the input source state to isolate side effects of recognizers?
And it still isn't enough. A recognizer for curly-based markup like foo{ any forth code bar{ nested code }bar ... }foo
cannot return something useful, since a useful thing in this case is a created definition or just a side effect of appending some semantics to the current definition. Should we also restore the state of the dictionary?
I think, it's obvious — isolation of all possible side effects of recognizers is not fruitful.
Yes, some recognizers returns objects that are not useful by themself, but they still return information what a given lexeme means, and it's an acceptable price for absent side effects for all recognizers.
Also we separate concerns into things that do have side effects (token translators) and things that don't have side effects (recognizers). It's very useful separation.
ruv
@BerndPaysan writes:
The code which I have simply looks like this:
['] translate-string of json-string! endof
A straightforward solution is to handle each token type of string literals separately. Probably, I would write it as follows:
'tt-slit of json-string! endof
'tt-slit-parsing of parse-slit-end json-string! endof
'tt-slit-ml of parse-slit-ml json-string! endof
(I would use a recognizer for a leading tick, and naming of translators in the form tt-{token-type-symbol}
)
Or I would factor a helper word as follows:
: ?prepare-tt-slit ( i*x tt -- i*x tt | sd.transient tt.slit )
case
'tt-slit of 'tt-slit endof
'tt-slit-parsing of parse-slit-end 'tt-slit endof
'tt-slit-ml of parse-slit-ml 'tt-slit endof
endcase
;
: eval-json ( .. tag -- )
?prepare-tt-slit case
...
'tt-slit of json-string! endof
...
endcase
;
ruv
Multiple entry points for the Forth recognizer
@BerndPaysan writes:
This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.
Yes, I also consider such a solution. It's a convenient solution to implement the default Forth recognizer.
But requiring the Forth recognizer to always conform this particular structure of recognizer sequences, and even always be the same instance of this structure, is too restrictive.
And otherwise you don't know the id of the actual names recognizer sequence (and even don't know whether such a sequence exists), and so you cannot check a lexeme against only this sequence (I mean, in implementation of recognize-tick
).
Filter recognizer results
Bernd, your word try-recognize
is a good factor to filter results, regardless side effects (beyond stacks). Having recognizers without side effects, it can be also implemented in a portable way.
If this word filters for a single token type, it's better to pass a corresponding tt directly (instead of xt.filter).
If this word allows to filter for multiple token types (I assume this variant), it should not drop tt from the stack.
Also, to be more useful, this word should not be bound to the current Forth recognizer only. Then, this word can be called as
apply-recognizer-filter
( sd.lexeme xt.recognizer xt.filter -- i*x tt | 0 )`.
A usage example:
: recognize-forth-name ( sd.lexeme -- nt tt.nt | 0 )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter
;
: find-forth-name ( sd.lexeme -- nt | 0 )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter if exit then 0
;
: find-forth-name? ( sd.lexeme -- nt true | false )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter 0<>
;
: recognize-tick ( sd.lexeme -- xt tt.xt | 0 )
"'" match-head 0= if 2drop 0 exit then ( sd2 ) \ the input lexeme without the leading tick
forth-recognizer [: dup 'tt-xt = ;] apply-recognizer-filter
;
The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.
Yes, but, as I show, >in
is not enough. Also, it's better to avoid such special cases in general.
ruv
FORTH-RECOGNIZER ( -- xt )
RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.Rationale:
FORTH-RECOGNIZE
is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of usingACTION-OF
FORTH-RECOGNIZE
. The old API has this function under the nameFORTH-RECOGNIZER
(as a value) and this name is reused. Systems that want to continue to support the old API can supportTO FORTH-RECOGNIZER
, too.
There is no way for a program to check whether it can apply TO
to FORTH-RECOGNIZER
, or FORTH-RECOGNIZE
, or RECOGNIZE-FORTH-LEXEME
, etc.
Thus, TO
cannot be optional. And it cannot be mandatory too. Thus, TO
cannot be a part of the API at all — neither RECOGNIZER, nor RECOGNIZER EXT.
Then the getter and setter should be a mandatory part of the API.
ruv
In continuation to the message:
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
This would be a valid argument if it were possible to return something useful from a recognizer in all use cases
Another example of an unuseful token is the result of the mentioned recognizer REC-TO
, which recognizes a syntax like ->foo
.
It's too restrictive to require this token be ( xt n tt )
, since in some systems it can be just ( xt.set-value tt )
, in other — ( addr.data-field xt.store tt )
.
This means that this token is not something useful to a program at all (apart of translation).
GeraldWodni
The committee thanks the authors for all the work. Here is the timetable:
- Everybody interested in this proposal: please submit your comments by end of October.
- Bernd (main author): please work this into a new version by the end of the year (2024).
- The committee will have a special interim meeting for this very proposal in February (final date will be announced in mattermost)
BerndPaysan
Concerning the setters and getters: I would prefer to make it mandatory that FORTH-RECOGNIZE
actually is a deferred word, and drop the additional getters and setters completely. DEFER
, IS
, and ACTION-OF
are all CORE EXT; so if you implement the recognizers, you have a dependency on those. The previous proposals had VALUE
and TO
and interface, which is also CORE EXT.
Gforth could support IS
and ACTION-OF
on recognizer sequences, too (i.e. assign n elements in order), through its polymorphous approach at all those words for value-style words (TO
, +TO
, ADDR
, IS
, ACTION-OF
all can do different things on different classes of values), but I guess that would be too much.
Can those setters and getters be optional in case you don't want to support DEFER
, and how can a program be written to work in both cases? If you have TOOLS EXT available, you can use
[DEFINED] is [IF]
is forth-recognize
[ELSE]
[DEFINED] to [DEFINED] forth-recognizer and [IF]
to forth-recognizer
[ELSE]
set-forth-recognizer
[THEN]
[THEN]
Yes, this is ugly and shows that having different options is not a good idea.
For the reworked proposal, I will need to restructure the proposal in a way that optional parts I rather want to remove are outlined as such, so that the final rewrite is easy.
ruv
Deferred words in API considered harmful
make it mandatory that
FORTH-RECOGNIZE
actually is a deferred word
As we have discussed, the main problem with a deferred word is that it can't be redefined by wrappers that have additional actions when setting or getting the value. In this respect, such a word in an API is as bad as an address-flavoured variable (like BASE
).
There is also a recent discussion in comp.lang.forth (link) under subjects "value-flavoured approach" and "value-flavoured structures".
Special data object on failure considered harmful
A question is what to return on failure (unsuccess): a special data object (xt of notfound
) or a common data object 0
(zero).
Below is a copy of my rationale from 2023, with some rewording.
There are two strong arguments against a special data object:
- consistency with other similar words;
- impact on the overall lexical size of programs.
Consistency
Many standard words returns some data object on success, or 0
(zero) on unsuccess/failure. This is possible because this data object cannot be 0
.
For example:
name>interpret ( nt -- xt | 0 )
find-name ( sd.name -- nt | 0 )
find-name-in ( sd.name wid -- nt | 0 )
find ( c-addr -- xt n | c-addr 0 )
search-wordlist ( sd.name -- xt n | 0 )
source-id ( -- fileid | -1 | 0 )
— not a fail, but also an example when zero was chosen instead of a special object.
Also, it is a common approach in practice. This allows common high-order functions operates on the common failure result 0
.
Why should not recognizers follow this practice? Why should they return a special id on failure rather than zero?
Lexical code size
Returning notfound
on failure makes the code shorter (in terms of lexemes) in some places. But the point is that it makes code longer in more places.
I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of a Recognizer API. In its code:
['] notfound
with=
or<>
is used 10 times, and without checking — 32 times.forth-recognize execute
is used 3 times.
If we use 0
(zero) instead of the notfound
xt, then:
['] notfound <>
is removed 5 times, which eliminates 15 lexemes;['] notfound =
is replaced with0<>
5 times, which eliminates 10 lexemes;['] notfound
is replaced with0
32 times, which eliminates 32 lexemes;- the definition for
notfound
is removed, a definition for?found
is added:: ?found ( x.some\0 -- x.some | 0 -- never ) dup 0= -13 and throw ;
, which adds not more than +3 lexemes; forth-recognize execute
is replaced withforth-recognize ?found execute
3 times, which adds +3 lexemes;- the word
?found
can be also used afterfind
,search-wordlist
,find-name
,find-name-in
— when the user needs to execute their result at once, and unsuccess should produce an exception.
Thus, replacing of notfound
by zero reduces the overall lexical code size in Gforth by more than 51 lexemes, which is more than 0.4KiB in absolute size (as on 2023-09-17).
So why should we prefer an approach that increases the overall lexical size of programs?
AntonErtl
About the proposal text
The "Problem" section does not describe a problem of Forth-2012 that the proposal wants to solve, but considers a problem with some other recognizer proposal. Similarly, the "Solution" section refers to some other recognizer proposal. This makes these sections useless for readers who have not first read up on the other proposal, which is not even linked here. Parts of the "Solution" section might be useful in another section on transitioning from the earlier proposal.
Instead, the "Problem" and "Solution" sections should describe what benefits this proposal adds to the standard, and how. A possible "Discussion" section and its subsections should describe the benefits of the present approach over possible alternative approaches (if that's too detailed, lazy system implementors will complain about the length of the proposal, but some complaints should just be ignored).
"Typical use" should of course be presented.
State-dependence
The proposal in its present form is unacceptable to me because it
defines a defining word TRANSLATE:
for state-dependent words, and
expects recognizers to produce the xt of state-dependent words. This
makes the translators hard to use anywhere except in INTERPRET
; the
proposed-for-standard interface is even hard (actually impossible with
standard means) to use in POSTPONE
, which is an intended user of
translators, as the proposal admits itself:
POSTPONE can do that without a standardized way
Another problem with the state-dependent translators is that it leads to either handwaving specifications of what they do, as evidenced in XY.3.1:
TRANSLATE-THING ( jx ix -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
in the non-specification of what translator-xt does in
FORTH-RECOGNIZE
and the handwaving specification of "name:" in
TRANSLATE
:
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
and the nonspecification of what TRANSLATE-NT
, TRANSLATE-NUM
,
TRANSLATE-DNUM
, TRANSLATE-FLOAT
, and TRANSLATE-STRING
do.
Or if you specify exactly what happens, it leads to lengthy texts that
explain the state-dependence, and the three different cases. And you
cannot even specify when xt-post is performed, because there is no
"postpone state" in the standard. On the contrary the current
document specifies that STATE
is either 0 (interpretation state) or
non-zero (compilation state), without any values left for a postpone
state, and specifies only words for getting into interpretation state
and compilation state, not postpone state.
If you really believe that the state-dependent approach is a good idea, please specify all these words exactly; the editor won't do it for you.
Opaque solution
If there is no need to make POSTPONE implementable in a standardized way, there is no need to make INTERPRET (which is not even standardized) implementable in a standardized way, either, and the translators can become a completely opaque thing that the standard does not document. In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token, and standard programs can only use that for implementing recognizers, but not for implementing text interpreters, POSTPONE, or anything else.
Transparent solutions
Alternatively, we might heed "Don't bury your tools!" and have a more useful interface for translators, like what we have seen in earlier drafts and other recognizer proposals.
POSTPONE
If the idea of the proposal is that xt-post is actually used by POSTPONE, the proposal should specify the change to POSTPONE.
Standardize recognizers
I expect that more people will want to compose existing recognizers into recognizer sequences than to define new recognizers, but they usually need to know about existing recognizers in order to do that. Therefore the proposal (or an accompanying proposal) should not just propose standard translators, but also standard recognizers.
ruv
@AntonErtl writes:
The proposal in its present form is unacceptable to me because it defines a defining word
TRANSLATE:
for state-dependent words, and expects recognizers to produce the xt of state-dependent words.
I do not like TRANSLATE:
either, but for a different reason. Sometimes it is very convenient to define a translator as a quotation (right inside the recognizer), and if you are forced to define a translator only with TRANSLATE:
, you cannot define it as a quotation.
This makes the translators hard to use anywhere except in
INTERPRET
;
Could you provide some examples, please? It seems, this is not harder than performing the observable interpretation semantics using the result of name>interpret
.
BerndPaysan
Concerning explicit access methods to xt-int/xt-comp/xt-post, I can offer the following compromise, as a result of observations made:
It turns out that you can not access xt-int and xt-comp by setting STATE
, executing the translator, and then reverting STATE
to the value before, because words can change STATE
as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.
However, it turns out that you can access xt-post that way, because the only word that possibly changes that state is [[
, and that token is a) no visible at all to POSTPONE
, and b) changes the state back to compilation state, the state POSTPONE
was in anyhow.
So if your system allows full explicit access to all three possible states, all translators have to be defined by 'TRANSLATE:', and I can offer you three access methods. If you only want to implement POSTPONE
, the following definition actually works:
: postpone ( "string" -- )
parse-name forth-recognize ?found
state @ >r -2 state ! execute r> state ! ; immediate
Further observations:
Gforth has >INTERPRET
and >COMPILE
, and doesn't use them, only >POSTPONE
is used. In exactly one place, in POSTPONE
. All other invocations are through EXECUTE
or only taking the data. The rest is implementation, including the extension towards more of those access methods for more, user-defined states. The question is whether you need to standardize a tool that has no use case, even if you don't bury it.
A possible way to deal with this is to move this out to a separate proposal.
What has been quite useful is the EXECUTE
interface for user-written interpreters, because these are interpret-only, and don't need the complication of state-dependent translators at all.
BerndPaysan
Sleeping over it added a few ideas:
The invocation through changing STATE
and restoring it works (in general) for translators that will definitely not change STATE
as part of their own operation, e.g. translators for literals. It also works (as a special case) for POSTPONE
, so a standard implementation of POSTPONE
using that method is possible. The postpone mode itself, which needs to change STATE
at [[
relies on the dispatch through STATE
without setting and restoring STATE
around the invocation, so it also works.
The question here is not if that implementation is a quality implementation, but whether it's not so bad that it is another bag full of inconsistencies. IMHO, TRANSLATE-NT
will have demonstrable inconsistencies when not using the clean TRANSLATE:
interface, but combined literal translators won't. For the cleaner interface outside of POSTPONE
itself (which is special case enough to not require the cleaner interface), we have to demonstrate that there is an actual use case. So far, we don't have one.
Both POSTPONE
with the additional functionality and the postpone mode ]]
… [[
will become part of the proposal.
ruv
@BerndPaysan writes:
It turns out that you can not access xt-int and xt-comp by setting
STATE
, executing the translator, and then revertingSTATE
to the value before, because words can changeSTATE
as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.
This is wrong. Yes, the state change can be a desired result of interpretation or compilation semantics, but this does not prevent us from performing the interpretation or compilation semantics regardless the initial value of STATE
, as I shown many times.
We can use the following helpers for that.
\ Useful factors
: compilation ( -- flag ) state @ 0<> ;
: enter-compilation ( comp: false -- true | comp: true -- true ) ] ;
: leave-compilation ( comp: true -- false | comp: false -- false ) postpone [ ;
\ For the execution semantics identified by xt,
\ perform the part that can be observed in interpreted state.
: execute-interpreting ( i*x xt -- j*x )
compilation 0= if execute exit then
leave-compilation execute enter-compilation
;
\ For the execution semantics identified by xt,
\ perform the part that can be observed in compilation state.
: execute-compiling ( i*x xt -- j*x )
compilation if execute exit then
enter-compilation execute leave-compilation
;
If we have a result of recognizing with the xt of a translator at the top (i.e., a fully qualified token), and we want to perform the corresponding interpretation semantics regardless of the current value of STATE
, we should execute this xt with execute-interpret
. If we want to perform the corresponding compilation semantics regardless of the current value of STATE
, we should execute it with execute-compiling
. If we want to perform the semantics according to STATE
, we should just execute this xt with execute
.
The key point in the implementation of execute-interpreting
and execute-compiling
is that we do not save/restore STATE
if it matches the semantics we want to perform — and if changing STATE
is part of the semantics, STATE
will be changed. On the other hand, if STATE
does not match the semantics we want to perform, we change STATE
and then restore it — if changing STATE
is part of the semantics, then it will change STATE
to the same value that was saved and to one we restore it to. Thus, the resulting STATE
will be as expected!
NB: execute-interpreting
and execute-compiling
are also required if we want to perform the interpretation semantics or compilation semantics from an nt, regardless the current value of STATE
. Moreover, these words are required even in the old approach for Recognizer API, which provides the words RECTYPE>INT
and RECTYPE>COMP
— because these words have the same flaw for state-dependent words as NAME>INTERPRET
and NAME>COMPILE
.
ruv
@AntonErtl writes about token translators:
if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.
This is reasonable. And we also discussed in the Recognizer chat group that the standard does not imply such a state as postponing (for the Forth text interpreter).
In my opinion, these problems can be avoided.
We should specify "to translate a token" and "token translator" in the common sections of term definitions, data types and usage requirements. Then, we do not need to repeat that for every token translator. It will be enough to specify that a word is a token translator, and the data type of the token (that it translates).
We can have a word like
postpone-token ( qt -- )
that append the compilation semantics of a lexeme, which was recognized as qt, to the current definition. (qt is a qualified token, which is a pair of an unqualified token and token translator ( uq tt ))
So, any additional state, if any, is encapsulated into postpone-token
. The standard should not specify it.
Thus, postpone
can be defined like this (in my parlance):
: postpone ( "name" -- )
parse-lexeme perceive ?found postpone-token
;
How postpone-token
finds/performs the postponing action from tt — it's an internal problem of implementation. The word postpone-token
should throw the exception -32 "invalid name argument" if a postponing action is not associated with tt.
We need to provide a way to associate a postponing action (an xt) with a tt, or to create a new tt from an xt and tt. The postponing action should be optional. The user needs to provide a postponing action only if they want to make postpone
applicable to the corresponding lexemes.
For example, we can have an optional word postponable ( tt1 xt.postponing -- tt2 )
. Probably, this word shall return the same tt2 for the same input pair ( tt1 xt.postponing )
.
This word is optional, because it can be implemented along with postpone-token
in a standard program, and postpone
can be redefined to use then.
ruv
@AntonErtl writes:
In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token,
I researched this approach.
In general, a recognizer returns a qualified token (qt) on success, where a qualified token is a pair of an unqualified token (uq) and a token descriptor (td).
Data type relations:
- unqualified token:
ut => ( S: i*x F: k*r )
- token descriptor:
td => x\0
- qualified token:
qt => ( ut td )
It is always possible to define a word translate-qtoken ( any qt -- any )
, which translates a qualified token (i.e., performs the interpretation or compilation semantics for the corresponding recognized lexeme depending on STATE). And as practice shows, it is very useful and in demand.
Additionally, in Forth, it is always technically possible to make the token descriptor also a token translator (that is a subtype of the execution token), without any loss (see an example).
- token translator:
tt => xt ; td = tt
So, instead of using a separate word translate-qtoken
, we can use the word execute
. And the Forth text interpreter simply executes the token translator (instead of applying translate-qtoken
to qt). Note that regardless whether the token descriptor is a subtype of the execution token, the token descriptor is opaque for the Forth text interpreter. The only difference is whether translate-qtoken
or execute
is using by the Forth text interpreter.
The big advantage of token translators is that they can be defined inline as quotations, and they can be used to define other token translators. This simplifies programs and reduces the lexical size of programs.
Also, token translators allow us to define dual-semantics words simpler.
For example, this is a definition for [']
, which has the expected interpretation semantics:
: ['] ( -- xt | ) ' tt-xt ; immediate
See also in my gist the word missing(
, which has the expected interpretation and compilation semantics. Without token translators such words are more difficult to implement.