Proposal: minimalistic core API for recognizers
This page is dedicated to discussing this specific proposal
ContributeContributions
BerndPaysan [160] minimalistic core API for recognizersProposal2020-09-06 09:40:07
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- rectype )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
then be told that this is not the right way, even though it looks like it is working.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL
):
REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
XY.3 Additional usage requirements
XY.3.1 Data type id
rectype: subtype of xt, and executes with the following stack effect:
RECTYPE-SOMETYPE ( i*x state -- j*x )
state is:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i?x is the additional information provided by the recognizer.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
RECTYPE-NULL ( state -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer state @ swap execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: rectype-num ( n state -- )
case
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
forth-wordlist find-name-in dup IF ['] rectype-nt ELSE drop ['] rectype-null THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
0. 2swap >number 0= IF 2drop ['] rectype-num ELSE 2drop drop ['] rectype-null THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
2>r
2r@ rec-nt dup ['] rectype-null <> IF EXIT THEN drop
2r@ rec-num dup ['] rectype-null <> IF EXIT THEN drop
2r> 2drop ['] rectype-null ;
' minimal-recognizer is forth-recognizer
Testing
JennyBrien
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE
do something like:
-2 RECTYPE-X
But mostly you'll have the RECTYPE
sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:
: postponed -2 swap execute ;
and
: postponed @ execute ;
Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)
Compare:
: rectype-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
with:
: rectype: create , , , ;
:noname name>interpret execute ;
;noname name>compile execute ;
;noname name>compile swap lit, compile, ; rectype: rectype-nt
BerndPaysan
One possible thing is to have an automatic postpone for literals.
: rectype-lit: ( compile-xt "name" -- )
create ,
does> @ swap
case
0 of drop endof
-1 of execute endof
-2 of dup >r execute r> compile, endof
endcase ;
' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string
This works with this method, but not with the previous way.
BerndPaysan
Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define
: rectype: ( xt-int xt-comp xt-post "name" -- )
create , , , does> swap 2 + cells + @ execute ;
and then define generic rectypes just like in Matthias Trute's version with rectype:
JennyBrien
: rectype-lit: ( xt -- ) ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;
not so straightforward, but possible.
ruv
Previous works
In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: ( i*x token -- j*x )
.
I described this approach in comp.lang.forth in 2018 (news:pngvcc$pta$1@gioia.aioe.org).
Bernd should also remember comparison of version D with Resolvers API, where I specified this approach, and even several POCs.
and then define generic rectypes just like in Matthias Trute's version with rectype:
I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).
But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.
This works with this method, but not with the previous way.
Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.
: create-rectype-for-literal ( xt-compiler "name" -- )
['] noop swap dup rectype:
;
Token translator
Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
RECTYPE-SOMETYPE ( i*x state -- j*x )
By convention, the name for such a word should start from an English verb.
Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.
E.g.:
: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ;
VS
: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s r> r> tt-lit-s ;
Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?
Terminology
Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.
Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).
ruv
Advantages
A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!
BerndPaysan
Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.
ruv
@JennyBrien wrote
Compare: [...] with:
: rectype: create , , , ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ; rectype: rectype-nt
(sic: the full postpone action).
This comparison is incorrect since in the proposed API rectype:
(that generates a token translator) can be defined as the following:
: rectype: ( xt-executer xt-compiler xt-postponer "name" -- )
>r >r >r : ]]
0 of [[ r> xt, ]] endof
-1 of [[ r> xt, ]] endof
-2 of [[ r> xt, ]] endof
-22 throw
endcase [[ postpone ;
;
And you can use the same your code to define your rectype-nt
or anything else.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- rectype )
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
then be told that this is not the right way, even though it looks like it is working.
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- rectype )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL
):
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
XY.3 Additional usage requirements
XY.3 Additional usage requirements
XY.3.1 Data type id
XY.3.1 Translator
rectype: subtype of xt, and executes with the following stack effect:
translator: subtype of xt, and executes with the following stack effect:
RECTYPE-SOMETYPE ( i*x state -- j*x )
SOME-TRANSLATOR ( i*x -- j*x )
state is:
A translator depends on STATE
to translate the given arguments:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i?x is the additional information provided by the recognizer.
i*x
is the additional information provided by the recognizer.
XY.6 Glossary
XY.6 Glossary
XY.6.1 Recognizer Words
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
RECTYPE-NULL ( state -- ) RECOGNIZER
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer state @ swap execute
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
case
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: rectype-num ( n state -- )
case
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt rectype-nt / rectype-null )
forth-wordlist find-name-in dup IF ['] rectype-nt ELSE drop ['] rectype-null THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
0. 2swap >number 0= IF 2drop ['] rectype-num ELSE 2drop drop ['] rectype-null THEN ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
2>r
2r@ rec-nt dup ['] rectype-null <> IF EXIT THEN drop
2r@ rec-num dup ['] rectype-null <> IF EXIT THEN drop
2r> 2drop ['] rectype-null ;
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] rectype-null THEN ;
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- rectype )
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
A translator depends on STATE
to translate the given arguments:
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL
if not.
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
BerndPaysan
Downside of using STATE
right in the dispatcher: POSTPONE
becomes more difficult. Instead of
: postpone ( "name" -- ) parse-name forth-recognizer -2 swap execute ; immediate
it is more convoluted
: postpone ( "name" -- )
parse-name forth-recognizer
state @ >r -2 state ! catch r> state ! throw ; immediate
How to detect [[
at the end of a postpone sequence is also not so trivial.
ruv
Downside of using STATE right in the dispatcher: POSTPONE becomes more difficult.
It's OK. Actually, we distribute complexity among various parts. When we make one thing less complex, we make another thing more complex. But due to the different numbers of occurrences of various things (in systems, libraries, programs) the summary complexity can be less or more.
This approach also makes some things more complex, but the summary complexity decreases, I believe.
Concerning POSTPONE
. I think, some useful parts should be factored out.
Also, we don't need to catch exception — usually, it's a stop error, and the state is ambiguous in any case. QUIT resets all the internal states. Concerning programs — we need a standard way to reset the internal states of the Forth text interpreter, regardless of Recognizers proposal.
In my "lexeme resolvers" implementation I use conception of postponing level that can be 0, 1, 2, and introduce the words to increment and to decrement this level.
So, POSTPONE
is defined as the following:
: postpone ( " name" -- ) parse-name inc-state translate-lexeme dec-state ( flag ) ?nf ; immediate
Where translate-lexeme
is defined as the following:
: perceive-lexeme ( c-addr u -- k*x xt-tt | c-addr u 0 )
perceptor dup if execute then
;
: translate-lexeme ( i*x c-addr u -- j*x true | c-addr u 0 )
perceive-lexeme dup if execute true then
;
(Note that in contrast of this proposal, resolvers return ( c-addr u 0 )
on fail)
How to detect
[[
at the end of a postpone sequence is also not so trivial.
An appropriate approach is that the word ]]
is a parsing word.
: ]] ( -- )
inc-state begin
next-lexeme 2dup s" [[" equals 0= while
translate-lexeme ?nf
repeat 2drop dec-state
; immediate
So we don't have any problem to detect [[
at the end.
An advantage of the postponing level conception is that the following code works as expected:
: foo [ ]] 123 . [[ ] ; foo \ prints 123
In the message news:rdcur5$ga4$1@dont-email.me (the full message: news:rdcn35$sd2$1@dont-email.me) I showed another approach, when postponing action is not required at all (i.e., -2 state in this proposal).
ruv
translator: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
It's correct in the general case, but it makes a little sense, since any definition meets this stack effect.
So I think we should distinguish the parameters of a translator itself from the effect of translating of the code that is passed to the translator. Possible variants:
\ We can define 'token' data type
TRANSLATE-SOMETOKEN ( i*x token -- j*x )
\ Some hybrid variant
TRANSLATE-SOMETOKEN ( i*x token{k*x} -- j*x )
\ Only low level data types
TRANSLATE-SOMETOKEN ( i*x k*x -- j*x )
(NB: I use a conventional naming {verb}-{noun} for such a words).
It should be also noted that these x may be distributed in all the stacks: the data stack, the floating-pint stack, the control-flow stack (except token k*x, that cannot be in the contrlo-fow stack).
BerndPaysan
Indeed, TRANSLATE-SOMETHING
sounds better than SOMETHING-TRANSLATOR
.
FORTH-RECOGNIZER
is ok, because it's followed by EXECUTE
, so this is a noun.
ruv
"FORTH-RECOGNIZER" name
I thought about FORTH-RECOGNIZER
name.
It makes a strong impression that this word is similar to FORTH-WORDLIST ( -- wid )
. The problem is that it isn't.
FORTH-WORDLIST
is a constant (it always return the same value), that indicates a one the same word list among all the word lists. This word list can be included into the search order, and it can be absent in the search order.
By analogy, FORTH-RECOGNIZER
should be a constant that indicates a one the same recognizer among all the recognizers. This recognizer can be included into the recognizer that is used by the Forth text interpreter, and it can be absent in the recognizer that is used by the Forth text interpreter. (In accordance with the conception that a sequence of recognizers is also a recognizer).
All these should be right to hold consistent naming. But actually it is wrong. It means, that this name breaks consistency and isn't inappropriate for the proposed word.
FORTH-RECOGNIZER ( -- xt )
can be a word that returns xt of the system's recognizer that is used by the Forth text interpreter by default (i.e. initially).
FORTH-RECOGNIZER is ok, because it's followed by EXECUTE, so this is a noun.
Also, it makes a strong impression that it returns a recognizer. But it's wrong. Also, it's result is analyzed much more often than it's followed by EXECUTE.
Basic methods
By no means, we need
- a method that tells the Forth text interpreter to use a given recognizer.
- a method that returns the recognizer that is currently used by the Forth text interpreter,
- a method that performs the recognizer that is currently used by the Forth text interpreter
A one differed word (a vector) X can solve it:
- set:
IS X
- get:
ACTION-OF X
- perform:
X
But I insist that this approach limits implementations too much. A Forth system can want to perform its internal actions on switching the recognizer that is used by the Forth text interpreter. But it cannot do it, if this recognizer is switched via IS X
method. For that, the different getter and setter words are usually provided in the Standard (except very ancient BASE
and >IN
— due to back compatibility).
Yes, perhaps Gforth can attach any additional internal actions for IS X
phrase. But we shouldn't complicate all Forth system implementations.
A possible implementation via deferred word and distinct getter and setter words:
defer perceive ( c-addr u -- k*x tt )
: perceptor ( -- xt ) action-of perceive ;
: set-perceptor ( xt -- ) is perceive ;
Perhaps, the more specific names are better (?):
defer perceive-lexeme ( c-addr u -- k*x tt )
: lexeme-perceptor ( -- xt ) action-of perceive-lexeme ;
: set-lexeme-perceptor ( xt -- ) is perceive-lexeme ;
ruv
Correction: pleas read "By anyway, we need" instead of "By no means, we need".
BerndPaysan
´DEFERis a core word now, so using
DEFER` for such a thing is ok. We don't need a special getter and setter for everything.
The implication that FORTH-RECOGNIZER
returns a recognizer (and does not, it executes one) is a valid point. A better name is needed. At the moment it is a VALUE
and does return a recognizer. Now, it is a deferred word, and does recognize strings. We should keep it with Anton's unification: a sequence of recognizers can be combined to one recognizer. Just because it's now recognizing more different things, it's still a recognizer. No need to find another synonym. Takes string, returns data+translator token ? is a recognizer.
Maybe RECOGNIZE-FORTH
is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.
ruv
DEFER
is a core word now, so usingDEFER
for such a thing is ok.
Actually, DEFER
, as well as TO
, is a Core extension word, so it's optional. But it's another argument.
Back to my first argument, what do you suggest if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter?
You can ask, do I have an example of such requirement. Yes, I do. I want to provide a method to undo such switching in my system. It's similar to effect of the "PREVIOUS" word for the search order. Perhaps you can suggest some solution with the deferred word?
Anton's unification: a sequence of recognizers can be combined to one recognizer.
Yes. I too said that any sequence of recognizers seq-x (from API v4) can be represented as a single recognizer : recognize-x seq-x recognize ;
. So, sequences are excessive in the basic API, — a Forth system doesn't need to know is it a sequence or not.
Maybe RECOGNIZE-FORTH is the corresponding verb. It takes a string and recognizes it if this is valid FORTH.
It's better. But it recognizes not valid FORTH, but anything what the Forth text interpreter can currently recognize (and only that).
Conceptually, this word isn't just a recognizer. There is a single special system's slot for a recognizer that is used by the Forth text interpreter. We can put any recognizer into this slot. We can also perform the recognizer that is placed into this slot. So this word performs the recognizer from this slot. I incline to call this slot "perceptor". And after that the word that performs the recognizer from this slot becomes "perceive".
All recognizer names have the pattern RECOGNIZE-*. The idea is to not put this special word on a par with all other recognizers. For that, its better to find a name that is distinct from the RECOGNIZE-SOMETHING pattern. What do you think?
ruv
Actually, DEFER, as well as TO, is a Core extension word, so it's optional. But it's another argument.
This argument is that a Forth system can be implemented as a minimal kernel and additional libraries. And DEFER
, IS
, ACTION-OF
can be available via a library. But when we put a deferred word into this API, we force a system's author to put DEFER
, IS
, ACTION-OF
into the kernel too. But actually they isn't required in the kernel. It would be too restrictive limitation on the implementations.
ruv
Locate
locate
cannot work for lexemes that can be recognized (translated) according to this proposal.
ruv
The last comment was intend for the proposal of AndrewHaley, and it was mistakenly placed here.
BerndPaysan
The recognizer will be an option, as well. At the moment, FORTH-RECOGNIZER
is proposed to be a value. That's also a CORE EXT word (as is TO
).
A minimalistic system that wants to implement recognizers needs FORTH-RECOGNIZER
to be a deferred word. I.e. it needs code for DODEFER
. It can load the rest of the deferred word stuff later as extension.
ruv
Certainly, recognizers is an option. I didn't mean that some required part requires an optional part. I mean that one optional part requires another complex optional part without any good and fair ground.
Yes, a minimalistic system that wants to provide a deferred word needs only code for DODEFER
. But it still makes bootstrapping of this system more complex. Hence, when we put a deferred word into API, we make things more complex for some implementations. But we don't even have a rationale for that.
Also, with deferred word we still don't have a solution if a system needs to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.
BerndPaysan
CORE has only VARIABLE
as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER
has to be clumsy, i.e.
forth-recognizer @ execute execute
Clumsy interfaces can not be changed if you have better things at hand. You can probably wrap around the clumsy interface, e.g.
Defer recognize-forth
addr recognize-forth Constant forth-recognizer
if you can use ADDR
to access the deferred word's xt storage location. But then you have another interface, less clumsy, and only available when you have DEFER
+ADDR
(and ADDR
is not even part of the standard).
A minimalistic API, as what I am looking for here is one where you don't have to document much. The less uniform an API is, the more you have to document. The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt )
as stack effect. And combinations of recognizers have the same effect. And the system's recognizer is just another one, which you can swap in and out. And you can define a REC-SEQUENCE
, where you can manipulate the sequence, and put that into the system's recognizer.
This uniformity is broken when you don't use a deferred word for the system's recognizer — you can't just call that one as you can call the others. You need @ EXECUTE
. This is clumsy.
ruv
CORE has only
VARIABLE
as option for storing things to change. As a result, the interface to use FORTH-RECOGNIZER has to be clumsy, i.e.forth-recognizer @ execute execute
I don't suggest to use a variable in the interface, — it's even worse than a defer. When a variable is used to change something, this changing cannot be effectively detected. But the requirement is: an ability for a system to perform internal actions on switching the recognizer that is currently used by the Forth text interpreter.
For that I would prefer to have the separate words in the API: a setter, a getter and a "performer" (a word that performs the recognizer that is currently used by the Forth text interpreter).
What are your objections to have several separate words in the minimalistic API?
The uniformity here is that a recognizer is a word that has ( addr u -- i*x translator-xt ) as stack effect.
I strongly support this approach (and I myself suggested this approach too, with slightly different stack effects).
This uniformity is broken when you don't use a deferred word for the system's recognizer
It seems, the set of words like the following (the names may vary):
perceive ( c-addr u -- k*x tt )
set-perceptor ( xt -- )
perceptor ( -- xt )
doesn't brake the mentioned uniformity. Please, clarify.
BerndPaysan
Using special setters and getters means you have another (special purpose) DEFER
mechanism here. Of course you can implement that with
variable current-perceptor
: perceive ( addr u -- i*j token ) current-perceptor @ execute ;
: set-perceptor ( xt -- ) current-perceptor ! ;
: perceptor ( -- xt ) current-perceptor @ ;
which is probably a bit less implementation effort than DEFER
, IS
, and ACTION-OF
. Or really?
State-Smart:
: defer Create ['] noop , does> @ execute ;
: is ' >body state @ if ]] literal ! [[ else ! then ; immediate
: action-of ' >body state @ if ]] literal @ [[ else @ then ; immediate
or with NDCS:
: defer Create ['] noop , does> @ execute ;
: is ' >body ! ; ndcs: ' >body ]] literal ! [[ ;
: action-of ' >body @ ; ndcs: ' >body ]] literal @ [[ ;
DEFER
is really a lightweight way to define words that can be changed.
These three lines of code are doing more than the three lines of code you need in addition when you have your special-purpose setter and getter, but they are still one-liners.
Forthers like to reinvent the wheel. But don't overdo this.
ruv
Using special setters and getters means you have another (special purpose) DEFER mechanism here.
Not necessary. It's up to an author/implementer. It can be just wrappers over standard DEFER, as I shown earlier. So it doesn't mean reinventing the wheel. The implementation details are just hidden.
So the arguments concerning implementation of DEFER mechanism say nothing against three separate words in the minimalistic API.
BTW, having translators for the basic data types, the words is
and action-of
can be even shorter:
: is ' >body tt-lit ['] ! tt-xt ; immediate
: action-of ' >body tt-lit ['] @ tt-xt ; immediate
Well, in any case I would agree that the arguments concerning complexity are more or less weak.
A strong argument (that wasn't yet commented) is about additional actions that a system needs to perform in the setter. What do you thing in this regard?
ruv
One more strong argument against DEFER word in the API, and pro the different getter and setter is following.
Having DEFER in the API, we cannot define this API over another API at all. But having the different getter and setter (and "executer") — it's possible to defined this API over some other APIs.
Example: news:rn1csa$b02$1@dont-email.me
BerndPaysan
Gforth's new header structure allows to overload TO
, IS
(which are essentially the same) and DEFER@
, so we can use the DEFER
API to access similar changeable execution patterns implemented differently. So for us, it makes sense to use these access words, regardless how it is implemented.
Other systems may not have this capability, though the way the standard now extends TO
for FVALUE
and others, you need to have one way or the other to deal with that. Same, when you have an UDEFER
in your system for user-specific deferred words.
For me, it is needless clutter of the dictionary and the mental space of the programmer to add setters and getters for things where you already have a generic one. But I see the point that not every system can do this.
ruv
needless clutter of the dictionary and the mental space of the programmer
I used an approach when a defined word creates two words — a getter and a setter.
It's something like after the phrase create-prop x
the words x
and set-x
are created. I didn't noticed any mental space clutter in this regard. Sometimes I redefined set-x
to add additional checks or actions.
Concerning dictionary space — I don't see any problem.
But I see the point that not every system can do this.
True. And even if a system can do this, it's done in some system specific way only.
So, due to the combination of all reasons, it's better to have distinct ordinary words in the standard API.
StephenPelc
If people are interested, I can arrange a virtual meeting for recognisers. They have been workshopped at various Forth Standards meetings but little of substance has emerged so far. I would suggest that such a meeting concentrate on finding what we can agree on.
Note that Forth-200x meetings are public, and the use of real names is strongly encouraged.
ruv
If people are interested, I can arrange a virtual meeting for recognisers. ... concentrate on finding what we can agree on.
I like this idea.
If people are interested, I will prepare before the meeting a proof of concept — an implementation of Recognizer API v4, Nestable Recognizer Sequences, or some other over this API.
Perhaps, somebody could share his list of questions before the meeting. My list at GitHub.
StefanK
A small remark to the POSTPONE test.
We can factor postpone in two parts with state-execute similiar to base-execute:
: state-execute ( xt s -- ) state@ >r state ! catch r> state ! throw ;
: POSTPONE ( "name" -- ) parse-name forth-recognizer -2 state-execute ; immediate
That's not very difficult anymore.
StefanK
IMHO the idea to use a deferred forth-recognize is good and more flexible than a stack of recognizers. But the translator xt makes postpone more difficult. But we can factor postpone into two parts. One that restores the stack contents at runtime similar to lit,
, and one that does the compilation. If we use rectype, similar to the proposal of recognizers from 2018, but with lit,
as third method, we get an easy postpone
and '
. Here, we can reuse the compile method directly by the second factor of postpone.
variable state
: translate>interpret @ ;
: translate>compile cell+ @ ;
: translate>lit, cell+ cell+ @ ;
\ Well, its translate>*lit, in fact; i.e. regenerate ( i*x ) at runtime.
Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer perform ( i*x translator -- j*x )
: perform>interpret translate>interpret execute ;
: perform>compile translate>compile execute ;
: on -1 swap ! ;
: off 0 swap ! ;
: [ ['] perform>interpret is perform state off ; IMMEDIATE
: ] ['] perform>compile is perform state on ;
\ alternativly:
\ :noname is perform state ; dup
\ : [ ['] perform>interpret [ compile, ] off ; IMMEDIATE
\ : ] ['] perform>compile [ compile, ] on ;
\ another alternative:
\ : perform state @ IF translator>compile ELSE translator>interpret THEN execute ;
' [ execute \ initialize state and perform
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer perform
REPEAT ;
: lit, ( n -- ) lit lit , , ; \ or postpone literal
: throw-13 -13 throw ;
: translator ( xt-*lit, xt-compile xt-interpret "name" -- )
create , , , ;
' throw-13 dup dup translator notfound
' lit,
:noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ;
:noname ( i*x nt -- j*x ) >cfa execute ;
translator translate-nt
' lit,
' lit,
:name ; \ noop
translator translate-const-cell
: rec-nt ( addr u -- nt translate-nt | notfound )
forth-wordlist find-name-in dup IF translate-nt ELSE drop notfound THEN ;
: rec-num ( addr u -- n translate-const-cell | notfound )
0. 2swap >number 0= IF 2drop translate-const-cell ELSE 2drop drop notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator | n num-translator | notfound )
2>r 2r@ rec-nt dup notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
\ simple postpone
: postpone ( "name" -- )
parse'n'recognize
dup translator>compile >r translator>lit, execute r> compile,
; IMMEDIATE
\ postpone optimized for immedate words
: postpone ( "name" -- ) \ optimized for immediate words
parse'n'recognize \ ( i*x translator )
dup translate-nt = IF ( nt translator )
over immediate? IF drop >cfa compile, exit THEN
THEN
dup translator>compile >r translator>lit, execute r> compile, ;
; IMMEDIATE
: ' ( "name" -- xt ) parse'n'recognize translate-nt <> IF throw-13 THEN >cfa ; IMMEDIATE
ruv
@StefanK, thank you for your participation. But it looks like you have missed too many arguments discussed above.
For example, a deferred word forth-recognizer
has a confusing name, and it cannot be acceptable in the API, since it's difficult for the Forth system to detect when its value is changed (NB: it isn't an argument in favor of "stack of recognizers").
: translator ( xt-*lit, xt-compile xt-interpret "name" -- ) create , , , ;
' lit, :noname ( nt -- xt-execute | xt-compile, ) dup >cfa swap immediate? IF execute ELSE compile, THEN ; :noname ( i*x nt -- j*x ) >cfa execute ; translator translate-nt
Also, could you please stick to a consistent and clear terminology?
In this example you create not a token translator, but a named token descriptor (and the corresponding token descriptor object). See Common terminology for recognizers (improvements and critics are welcome).
token descriptor object: an implementation dependent data object that describes how to interpret, how to compile and how to postpone (if any) a token .
I also proposed the following naming convention for the corresponding words:
- For token translators use names in the form
tt-*
— that is the abbreviation oftranslate-token-*
; for example,tt-lit
,tt-nt
. - For token descriptors use names in the form
td-*
— that is the abbreviation oftoken-descriptor-*
; (for example,td-lit
,td-nt
)
The employed approach in your example to create a token descriptor can be called "three components" approach. A significant disadvantage of this approach is that it doesn't provide a way to reuse old descriptors when you create a new descriptor. Compare to token translators — they can be easily reused to create new token translators. For example, a token translator for a pair ( nt nt ) can be created using the token translator tt-nt
for a single nt as:
: tt-2nt ( i*x nt nt -- j*x ) >r tt-nt r> tt-nt ;
To create a token descriptor td-2nt
in the three components approach, you need to put in a lot more effort, and you cannot reuse td-nt
descriptor.
One possible solution is to don't expose the three components approach in the API and instead provide a special method to create a descriptor from another descriptors.
For td-2nt
it can look as:
tt-nt dup 2 descriptor constant tt-2nt
\ or
td{ tt-nt tt-nt }td constant tt-2nt
It seems, a user never needs to provide three components for a new descriptor since any new descriptor is always based on some already defined descriptors.
But the approach based on the token translators is far simpler.
By the way, a well known word to get xt from nt is name> ( nt -- xt )
(see Forth-83 / "C. Experimental proposal" / "Definition field address conversion operators").
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translator | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
XY.3.1 Recognized
translator: subtype of xt, and executes with the following stack effect:
recognized: subtype of xt, and executes with the following stack effect:
SOME-TRANSLATOR ( i*x -- j*x )
RECOGNIZED-THING ( j*x i*x state -- k*x )
A translator depends on STATE
to translate the given arguments:
A recognized xt acts on the state passed to it on the stack
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZER ( addr len -- i*x translator | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or NOTFOUND
if not.
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
if the exception wordset is available.
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
The different actions during interpret/compile/postpone can be factored out easily, and used by a common dispatcher:
Extensions reference implementation:
: translator: ( xt-interpret xt-compile xt-postpone "name" -- )
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Testing
Stacks TBD.
Testing
TBD
ruv
A recognized xt acts on the state passed to it on the stack
A proper term for "recognized xt" ("recognized execution token") should be chosen. "recognized xt" means "xt that is recognized", but we don't recognize execution tokens, but recognize lexemes. This xt just is a result of recognizing a lexeme. And it should be named according what it does, not according who produces it.
There is no reason to pass state on the stack — we discussed that, and the reference implementation reflect that.
BerndPaysan
The STATE
discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE
. The reference implementation needs to be adjusted.
For the name of the result values we might want to have another round of bikeshedding. In particular with more native speakers. The current wording represents the last round of bikeshedding.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
ELSE drop ['] notfound THEN ;
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Recognized
recognized: subtype of xt, and executes with the following stack effect:
RECOGNIZED-THING ( j*x i*x state -- k*x )
A recognized xt acts on the state passed to it on the stack
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
Defer forth-recognizer ( addr u -- i*x translator / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: nt-translator ( nt -- )
case state @
: recognized-nt ( nt state -- )
case
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
endcase ;
: num-translator ( n -- )
case state @
: recognized-num ( n state -- )
case
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] nt-translator ELSE drop ['] notfound THEN ;
forth-wordlist find-name-in dup IF ['] recognized-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] num-translator ELSE 2drop drop ['] notfound THEN ;
0. 2swap >number 0= IF 2drop ['] recognized-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
Stacks TBD.
Testing
TBD
ruv
The STATE discussion in the 2021 workshop concluded that words or xt executed should not depend on STATE.
I see the following in the report by @ulli on 2021-09-08:
Given the two variations to handle STATE (either in RECOGNIZER:'s DOES> part or in INTERPRET), yesterdays participants favoured to have the single occurrence of STATE in INTERPRET. Further investigation and model implementations will show whether on or the other is beneficial.
So it implies further investigation and model implementations.
Could someone provide a rationale in favor to pass state (better say "mode") via the stack?
My rationale against mode on the stack is following:
- It makes combination of token translators cumbersome. E.g. a definition
: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ;
becomes far more complex. - In most cases a program doesn't need to execute a token translator in a mode that is different from the current mode (counter examples are welcome, except
postpone
). - The current mode is already held by the system anyway.
- (most importantly) It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API. This loop does not need to know anything about modes and
STATE
at all. If we are replacing the system's lexeme translator (along with the system's set of token translators), we should be able to replace it along with the system'sSTATE
(and the set of the system's modes) too. Moreover, a token translator can technically ignore the passed value and use it's own set of modes. And even such a simpler mode-beyond-stack API can be implemented over one that passes mode via the stack.
On the other hand I don't think that including (mentioning) STATE
in a new API is a good choice. STATE
returns a read-only address, and it's provided for back compatibility only. So a better method instead of STATE
is required anyway.
Actually, the system's token translators are the only ones who depend on the system's set of modes. In most cases user-defined token translators are defined via system's token translators (which should be standardized) and they need to know nothing about system's set of modes, and about STATE
at all. In the same time, a user is able to define own set of recognizers and set of token translators that don't depend on system's set of modes, but introduce own set of modes.
So, the specification for Recognizer API should not mention nether STATE
nor a set of magic values like {0, -1, -2}.
Concerning your mode -2
— I believe, the standard word postpone
doesn't need an own mode. But in postponing mode, if any, string literals (like s" foo bar"
) and comments should be properly treated.
ruv
- It introduces unnecessary coupling between the Forth text interpreter loop and the Recognizer API.
See a block-based illustration of this idea in my Gist.
ruv
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- ) create , , , does> state @ 2 + cells + @ execute ;
In the reference implementation you keep the mode {0,-1,-2} in the state
variable.
But it's problematic, regardless how the mode is passed into token translators (directly via the stack, or indirectly using a dedicated method).
Since, when interpretation state is set by [
(so state
is set to 0
), and then compilation state is set by ]
, the mode should be the same as before [
. If it was -2
, it should be set to -2
. But information that the mode was -2
is lost. So another variable should be used to keep a flag whether "POSTPONE" mode is active or not.
Actually, the mode of compilation/interpretation and "POSTPONE" mode are not mutual-exclusive. They can be set independently of each other.
For example, the code:
: foo postpone bar [ postpone baz ] ;
conceptually can be pretty clear defined (see my comment). In this fragment, for the lexeme bar
"POSTPONE" mode is active in compilation state, for baz
"POSTPONE" mode is active in interpretation state.
So, if "POSTPONE" mode is employed, a different variable for it should be used for this reason too.
On the other hand I'm not convinced that we need "POSTPONE" mode at all.
Except to implement the word postpone
itself, where and how this mode can be used? Even for the questionable construct ]] ... [[
the mode "POSTPONE" isn't needed.
BerndPaysan
OK, the most convincing argument is that STATE
can go away as specified thing. You can use and combine system translators, and you can create table-driven translators, but STATE
is an implementation detail.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make the system's
forth-recognizer
a deferred word to allow plugging in new recognizer sequences
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: word-translator ( xt flag -- )
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] word-translator
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Recognized
XY.3.1 Translator
recognized: subtype of xt, and executes with the following stack effect:
translator: subtype of xt, and executes with the following stack effect:
RECOGNIZED-THING ( j*x i*x state -- k*x )
THING-TRANSLATOR ( j*x i*x -- k*x )
A recognized xt acts on the state passed to it on the stack
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
- 0 for interpretation
- -1 for compilation
- -2 for POSTPONE
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x recognized-xt | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the recognized xt and additional information if successful, or NOTFOUND
if not.
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
GET-FORTH-RECOGNIZE ( -- xt ) RECOGNIZER EXT
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
RECOGNIZED: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a recognized word under the name "name", which performs xt-int for state=0, xt-comp for state=-1 and xt-post for state=-2.
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognizer ( addr u -- i*x translator / notfound )
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: recognized-nt ( nt state -- )
case
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
-2 of name>compile swap lit, compile, endof
nip // do nothing if state is unknown; possible error handling goes here
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: recognized-num ( n state -- )
case
: translate-num ( n -- )
case state @
-1 of lit, endof
-2 of lit, postpone lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] recognized-nt ELSE drop ['] notfound THEN ;
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] recognized-num ELSE 2drop drop ['] notfound THEN ;
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: recognized: ( xt-interpret xt-compile xt-postpone "name" -- )
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) cell+ @ execute ;
: translate-post ( translate-xt -- ) @ execute ;
Stacks TBD.
Stacks TBD, copy from Trute proposal.
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
THING-TRANSLATOR ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
REC-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-REC-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-REC-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognizer ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognizer execute
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) cell+ @ execute ;
: translate-post ( translate-xt -- ) @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Stacks TBD, copy from Trute proposal.
Stack library
: STACK: ( size "name" -- )
CREATE 1+ ( size ) CELLS ALLOT
0 OVER ! \ empty stack
;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP NOTFOUND
;
: recognizer-sequence: ( rec1 .. recn n "name" -- )
dup stack: dup cells negate here + set-stack
DOES> recognize ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-nt ( addr u -- translator )
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-word ( addr u -- ... translator )
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a recognized xt and additional data on the stack (no additional data for NOTFOUND
):
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x recognized | NOTFOUND )
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
THING-TRANSLATOR ( j*x i*x -- k*x )
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, jx and kx are the stack inputs and outputs of interpreting/compiling or postponing the thing.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. An ambiguous condition exists if the exception word set is not available.
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using IS FORTH-RECOGNIZE.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 1+ ( size ) CELLS ALLOT
0 OVER ! \ empty stack
;
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP NOTFOUND <> IF
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP NOTFOUND
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
dup stack: dup cells negate here + set-stack
DOES> recognize ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) >body get-stack ;
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognizer ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognizer
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
' root ' forth ' forth 3 recognizer-sequence: search-order
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
AntonErtl
It seems to me that, given the reference implementation
' translate-nt translate-int
' translate-num translate-int
' translate-dnum translate-int
does not work (nor with translate-comp nor translate-post). Assuming you solve this, do you really want me to define, e.g.,
:noname ['] translate-nt translate-int ;
to get an xt equivalent to one of the xts that has been passed to translate:
?
How do you implement POSTPONE (IIRC Matthias Trute has a reference implementation for that)?
What problem is solved by making all the translators state-smart? The problem I see is that you can only access the individual actions by saving state
, setting state
, executing the translator, and restoring the state. That's not a good design.
The specification of translate:
mentions a "current mode". Where do I find out what a "mode" is? This is non-standard terminology.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that performs or compiles the action of the thing according to what the state the system is in.
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
TRANSLATE-INT ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
Translate as in interpretation state
TRANSLATE-COMP ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Translate as in compilation state
TRANSLATE-POST ( jx ix translator-xt -- k*x ) RECOGNIZER EXT
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Translate as in postpone state
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate-nt ( nt -- )
case state @
0 of name>interpret execute endof
-1 of name>compile execute endof
nip \ do nothing if state is unknown; possible error handling goes here
endcase ;
: translate-num ( n -- )
case state @
-1 of lit, endof
endcase ;
: translate-dnum ( d -- )
\ example of a composite translator using existing translators
>r translate-num r> translate-num ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: get-forth-recognize ( -- xt )
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
: translate-int ( translate-xt -- ) >body 2 cells + @ execute ;
: translate-comp ( translate-xt -- ) >body cell+ @ execute ;
: translate-post ( translate-xt -- ) >body @ execute ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Defining translators
Once you have TRANSLATE:
, and the associated invocation tools, you shall define the translators using it:
: lit, ( n -- ) postpone Literal ;
' noop ' lit, :noname lit, postpone lit, ; translate: translate-num
:noname name>interpret execute ;
:noname name>compile execute ;
:noname lit, postpone name>compile postpone execute ; translate: translate-nt
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Translate as in interpretation state
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Translate as in compilation state
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Translate as in postpone state
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap literal compile, ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
INTERPRET-TRANSLATOR ( tanslate-xt -- xt-interpret ) RECOGNIZER EXT
Get the interpreter xt from the translator
COMPILE-TRANSLATOR ( tanslate-xt -- xt-compile ) RECOGNIZER EXT
Get the compiler xt from the translator
POSTPONE-TRANSLATOR ( tanslate-xt -- xt-postpone ) RECOGNIZER EXT
Get the postpone xt from the translator
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
: interpret-translator ( tanslate-xt -- xt-interpret ) >body 2 cells + @ ;
: compile-translator ( translate-xt -- xt-compile) >body 1 cells + @ ;
: postpone-translator ( translate-xt -- xt-postpone ) >body 0 cells + @ ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
BerndPaysan
I removed the access words to the xts for a reason: We don't do it that way in Gforth, and we actually found little use of those words.
There are (at least) three ways to create an interface to these operations:
- Field-like, i.e. you can read and write the xts for interpretation/compilation/postpone. The typical usage is ( translator )
TRANSLATE-
<state>@ EXECUTE
. - Valuefield-like, as in the original Trute proposal. You can read the xts for interpretation/compilation/postpone without an extra
@
, but you can't write them, unless your access word also implementsTO
. The typical usage is ( translator )TRANSLATE-
<state>EXECUTE
. - Deferfield-like, which is what Gforth does. Here, not only the
@
is part of the operation, but also theEXECUTE
(really a tail-call variant of it). You can neither read nor write the xts, unless the access words also implementIS
andACTION-OF
. The typical usage is ( translator )TRANSLATE-
<state>, and that looks about right. Gforth uses different names to not collide with the proposal here.
Gforth offers as an extension to add more states and thus more access words, and that extension also adds IS
, TO
(which are synonyms) and ACTION-OF
to the existing (it is only one, only for postpone state you need it explicit) access word, and also implements the other two for interpret and compile, which are never used on their own. Of course when you add a new state, you need to specify what existing translators do on that state, so IS
becomes necessary, and ACTION-OF
just comes for free through Gforth's way of implementing TO
and variants, of which ACTION-OF
is one. This extension is non-standard, and not proposed here, it is used for creating obscured (“tokenized”) source code and reading name=value-style config files.
The experience so far is that outside of this extension, there's only one of those three access words needed at all, which is TRANSLATE-POSTPONE
, and it is exclusively needed inside the standard word POSTPONE
itself, a word where the implementation is left up to the system anyways. So the usage of these words is extremely limited. Therefore, I deleted them and suggest not to standardize these words, following the “don't speculate” rule and the topic of this proposal to make a minimalistic API, which contains only what's necessary. These words are of little use, and therefore there's no need to standardize them.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token; system component you can use to construct other translators of.
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number; system component you can use to construct other translators of.
Translates a number.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
AntonErtl
Making the translators depend on state
is a bad idea. It means that everything using the translators becomes infected with this state-dependency. It also means that you cannot implement postpone
or ]]
...[[
as standard-compliant code (while, with state
-independent translators you could).
Moreover, when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set state
before executing the translators, which is perverse. And in the case of colorforth-bw, again there is no standard way to set state
to get the translator to perform xt-post.
BerndPaysan
The experience with the usage in Gforth (non-standard extensions excluded) shows that direct calls to translators with a specific state are limited to postpone
, which is compile-only and therefore
: postpone ( "name" -- )
-2 state ! parse-name forth-recognize execute -1 state ! ; immediate compile-only
is not generating surprises (postpone
is expected to leave the system in compilation state after it has done its work). In Gforth, ]]
and [[
are implemented by changing state, and for recognizing the super-immediate [[
a special recognizer is added to the stack which returns a translator that has a specific postpone effect that changes back to compilation state and drops the additional recognizer from the stack.
' noop dup :noname ] forth-recognizer stack> drop ; translate: translate-[[
The state
-dependent invocation is the 99.9% case for translators, and that includes ]]
and [[
.
The Forth outer interpreter depends on state
(or a similar internal representation). The object that deals with the different actions depending on state
is the translator.
The proposal allows you to implement other ways to access the individual methods of a translator, if you need them. It does not encourage anymore to use translators as building blocks for other translators, and we can add wording that only translators created by translate:
are standard-conforming. Since there's little use for these other access methods, it does not suggest to standardize those.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a user-defined translator from scratch.
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
Translates a number.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Testing
TBD
ruv
Anton writes:
Making the translators depend on state is a bad idea.
It's a subject of terminology. A translator depends on state by definition:
to translate a token: to interpret the token if interpreting, or to compile the token if compiling.
token translator: a Forth definition that translates a token; also, depending on context, the execution token for this Forth definition.
If you want a recognizer to return not a execution token, but some opaque identifier, I suggest to call it "descriptor".
token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.
token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value (i.e., a constant), or a token descriptor object itself.
BerndPaysan
As said, this is just about moving things around. There's little difference if you use translator-execute
or execute
on a translator as specified way to get from the translator to its state-dependent action. It's all system-dependent and hidden, and systems might implement it without even referring to STATE
and only update STATE
to reflect compilation and interpretation state and otherwise never look at it, and the way the system internally keeps its state can be completely different. The most obvious difference is that with translator-execute
, you need another word.
The fact that some abstract data type is executable does not mean EXECUTE
is the only way to operate on it. Recognizer sequences are executable in this proposal, and they still can be read out with and set by GET
/SET-RECOGNIZER-SEQUENCE
. So you can't just define them as colon definitions, you need to go through RECOGNIZER-SEQUENCE:
to define them.
Though I don't propose to standardize this, the proposal also suggests to make word list ids executable, and put them together in a recognizer sequence called search-order
. word list ids still have be used in other ways, e.g. to add new words to them, and the details are left to the system; but it is clear that they can't be normal colon definitions.
Not providing an abstraction like either translator-execute
or execute
, and instead putting it directly as state @ abs cells + @ execute
into the outer interpreter is a really bad idea, because all details of the reference implementation in which this sequence works become then part of the standard. Other ways to implement it, which may have performance advantages, or not expose the postpone state in STATE
would then not be allowed. The following implementation should be standard, too:
: do-translate ( translate-body -- ) 0 + @ execute ;
: state! ( state -- ) dup state ! abs cells ['] do-translate >body cell+ ! ; \ assume threaded code
: translate: ( int-xt comp-xt post-xt "name" - - )
create swap rot , , ,
does> do-translate ;
: [ 0 state! ; immediate
: ] -1 state! ;
: ]] 2 cells ['] do-translate >body cell+ ! ; immediate \ STATE left as is
How to recognize [[
is left as exercise to the reader, hint: a recognizer is a good idea, because it actually provides something that is executed at postpone time.
AntonErtl
By comparison, with the first version of this proposal postpone
can be implemented like this:
: postpone parse-name forth-recognize -2 swap execute ; immediate
which would not contain non-standard usage like -2 state !
, and it would also work in interpret state (not the most important feature, but a feature nonetheless). And ]]
could also be implemented as a standard program.
I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters. Other uses may be rare, but they exist, and people may come up with more over time if we make the interface flexible enough. The proposal does not propose to standardize state-independent ways to get at the functionality. Therefore, if the proposal is accepted, they don't exist for standard programs, and therefore they are not counterarguments against the disadvantages of the proposed state-dependent-only translators. The fact that this state-dependence means that you cannot use rectypes/translators to build other rectypes/translators is another (minor) argument against the state-dependence.
Concerning having a state-independent rectype as an abstract data type, the first version of this proposal proposed that rectype is an executable word with stack effect ( i*x state -- j*x )
where state would be 0, -1, or -2. This does not expose anything about the internals, and even allows to define rectypes without using a special defining word. The invocation in the text interpreter is ( i*x rectype ) state @ swap execute
, and in postpone
it´s as shown above.
Alternatively, if the rectype is the address of some data structure, yes, we would need an additional word, maybe rectype-translate ( i*x rectype n -- j*x )
that performs the access to the data structure. The usage in the text interpreter would be ( i*x rectype ) state @ rectype-translate
and the usage in postpone
would be ( i*x rectype ) -2 rectype-translate
.
ruv
Bernd writes:
The most obvious difference is that with
translator-execute
, you need another word.
Yes, essentially I agree concerning translator-execute
and execute
alternatives.
Yet another difference is that with translator-execute
the Forth text interpreter (the outer loop) should know this additional word (probably it means more degree of coupling). But with execute
— it should not know any additional word.
to make word list ids executable [...] but it is clear that they can't be normal colon definitions
Another example is defer-words (words created by defer
), which are executable but are not normal colon definitions — defer!
and defer@
can be applied to their xt.
The following implementation should be standard, too
The provided implementation for ]]
is system dependent, namely it depends on implementation of Recognizers API.
But, anyway, Gforth's ]]
can be implemented in a standard way via postpone
.
BerndPaysan
A translator is the address of a data structure, which also happens to be executable. This is not a contradiction! And there was a proposed standard way to access fields directly, renamed from the Trute proposal (but with otherwise identical, value-field like semantics) to INTERPRET-TRANSLATOR
, COMPILE-TRANSLATOR
, and POSTPONE-TRANSLATOR
. The reason I deleted these is that we don't even use them in Gforth, we only use >POSTPONE
, which has a different effect (it does not read out the xts, it executes it right away). If there is consensus that this is the right interface (not a value-field, but a defer-field), I can add this back to the proposal; as well as adding a standard way to set the state without knowing the internals of the system, for which the file recognizer-ext.fs
in Gforth also provides a suggestion:
: translate-state ( translator-access-xt -- )
\ takes a translator access xt, and may check if that actually is one
>body @ cell/ negate state ! ;
The hypothetical more performant implementation in Reply 1043 would have a different translate-state
, which would contain something like
>body @ ['] do-translate >body cell+ !
and only change STATE
for interpret/compile.
This proposal is minimalistic on purpose and does not cover all corner cases, especially not those where no consensus has been reached yet.
I consider the magic number dispatch method proposed earlier as not appropriate: this is tied to a specific implementation, and not a good interface. Method invocation or field access should be done by named access words, not by numbers.
ruv
Anton writes:
postpone can be implemented like this
postpone
can be implemented in any variant of the Recognizer API, with more or less code.
A difference is whether the behavior of postpone
can be extended/changed without redefinition of postpone
.
My point: if users need to extended behavior of postpone
without redefinition, then a special method can be specified for that. OTOH, postpone
(and ]]
) is a poor man's "postponing mode". An example of a more convenient tool is my c-state PoC, which provides a better tool for users, and it even supports any new user-defined special words.
I don't want to restrict the usage of rectypes/translators to state-dependent outer interpreters.
It's not an argument, since the API can provide words like compile-token
, execute-token
, postpone-token
, having ( i*x xt.translator -- j*x )
or ``( ix rectype -- jx )`, which are state-independent and don't restrict usage in the mentioned way.
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
- 2023-09-15 Add list of example recognizers and their names.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x
is the additional information provided by the recognizer, j*x
and k*x
are the stack inputs and outputs of interpreting/compiling or postponing the thing.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn and proceeding towards xt1 until successful.
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as xt1 .. xtn n.
Obtain the recognizer sequence xt-seq as n*xt n.
TANSLATE-NT ( jx nt -- kx ) RECOGNIZER EXT
TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( jx x -- kx ) RECOGNIZER EXT
TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT
Translates a number.
TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT
Translates a double number.
TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT
Translates a floating point number.
TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT
Translates a string.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Recognizer examples
REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.
REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.
REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.
REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.
REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO
-like operations of value-like words:
* ->
value as TO
value or IS
value
* +>
value as +TO
value
* '>
value as ADDR
value
* @>
value as ACTION-OF
value
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${
name}
and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.
REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns translate-complex on success.
Testing
TBD
BerndPaysanNew Version: minimalistic core API for recognizers
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator. - 2023-09-15 Add list of example recognizers and their names.
Problem:
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
Solution:
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Important changes to the original proposal:
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: rec-xt ( addr u -- translator )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
then you should factor the part starting with state @ out and return it as translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
TBD
Proposal:
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
NOTFOUND ( -- ) RECOGNIZER
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Create a translator word under the name "name". This word is the only standard way to define a translator.
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Assign the recognizer xt to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
Rationale:
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.
SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as n*xt n.
TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT
Translates a name token.
TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT
Translates a number.
TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT
Translates a double number.
TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT
Translates a floating point number.
TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT
Translates a string.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
translate: translate-nt ( nt -- )
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Recognizer examples
REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.
REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.
REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.
REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.
REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO
-like operations of value-like words:
* ``->``*value* as ``TO ``*value* or ``IS ``*value*
* ``+>``*value* as ``+TO ``*value*
* ``'>``*value* as ``ADDR ``*value*
* ``@>``*value* as ``ACTION-OF ``*value*
->
value asTO
value orIS
value+>
value as+TO
value'>
value asADDR
value@>
value asACTION-OF
value
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${
name}
and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.
REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns translate-complex on success.
Testing
TBD
BerndPaysan
Things to discuss, because there are still too many variables.
ToDo:
- Rename Recognizers from
REC-
result toRECOGNIZE-
result. A solution for.RECOGNIZERS
drowning the reader inrecognize-
could be to skip that prefix, because all recognizers are supposed to have the same prefix, anyways. - Revert the name of translators to rectypes or some similar word showing that this does describe a type?
- Add mode/state-specific access words to the translators again and decide on how they work. I prefer defer-field likes, which right away execute the corresponding action, and not put an xt on the stack for consumption. Defer-fields could work together with
IS
andACTION-OF
to access the xts within (in Gforth, they do).
Answers to some questions:
A lot of thoughts went into it to make different subsets of this proposal useful on their own, and allow different implementation strategies. The answer to “can I do without feature X” is most likely yes. You can use the subset of the features you want. Stripping away too much results in a subset no longer usable.
- Opening up the whole idea to small systems is useful to gain wider use.
FORTH-RECOGNIZE
is a deferred word in the reference implementation on purpose, and that allows changing it without adding more words. To add more implementation options, you can use the setter and getter words (which are optional) if you don't want to implement it as deferred word to swap in and out named sequences.- The recognizer sequences do have words to get and set the sequence, so you can just work with a single sequence and set/get it if you like. The nesting capability comes by the magical fact that a recognizer sequence has the same stack effect as a recognizer.
- You can do without both, because recognizer sequences can be written as colon definitions “by foot”.
- Named sequences are useful, especially when you swap in recognizer sequences for applications that do something completely different than the Forth recognizer sequence. If you do not want to support named sequences, you can still provide the one single named sequence
FORTH-RECOGNIZE
, and allowSET-RECOGNIZER-SEQUENCE
andGET-RECOGNIZER-SEQUENCE
to operate just on that. That's also an option where recognizers are useful without havingFORTH-RECOGNIZE
being deferred and noRECOGNIZER-SEQUENCE:
. - The
NOTFOUND
return for failure is there so that you can alwaysEXECUTE
the result ofFORTH-RECOGNIZE
and don't have to check for errors there.
Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because TRANSLATE-STRING
no longer has the corresponding string on the stack, but needs parsing it later. Actually, parsing should happen in PARSE-NAME
. It still seems to be a hack that doesn't have a perfect solution.
ruv
Rename Recognizers from
REC-
result toRECOGNIZE-
result
In general, an abbreviation or acronym may be acceptable to me. But in this case I prefer RECOGNIZE-
rather than REC-
. The main disadvantage of rec
if that it has misleading associations. And the main advantage of recognize
is that it's a whole English word that is very appropriate for our case.
The part referred as "result" should not be a result (of recognizing), but the expected type of the input lexeme. Have a look in your examples — REC-NUM
and REC-TICK
produce the same result type translate-num
, but they accept different types of input lexemes, and these types are identified by NUM
and TICK
symbols correspondingly.
Thus, the naming form for recognizers can be expressed as RECOGNIZE-{lexeme-type-symbol}
.
Revert the name of translators to rectypes or some similar word showing that this does describe a type?
It does describe a type of what? It describes a type of a token i*x
, which is a result of recognizing. Actually, a token translator identifies the type of a token i*x
, which is a result of recognizing. Then, a token translator is a token type in the same time.
If we want to reflect this idea, we can use the acronym tt
, which stands for both: token translator and token type. Then, token translators can be named according to the form TT-{token-type-symbol}
. It looks elegant to me.
The names of translators are used for two purposes: to call a translator (for example, when we define a new translator via existing translators), and to obtain xt of a translator (which is an identifier for a token type in the same time) — to analyze a result of recognizing. The prefix tt-
looks good in these both case.
ruv
An example of use translators for two different purposes:
\ use "tt-lit" and "tt-2lit" just to call these token translators:
: tt-3lit ( 3*x -- 3*x | )
>r tt-2lit r> tt-lit
;
: recognize-forth-lexeme ( sd -- i*x tt ) forth-recognizer execute ;
\ use "tt-xt" to analyze a token type:
: recognize-tick ( sd -- xt tt.xt | 0 )
"'" match-head 0= if 2drop 0 exit then ( sd2 ) \ the input lexeme without the leading tick
['] recognize-forth-lexeme execute-balance2 ( i*x tt|0 n.data-stack n.float-stack )
2>r dup ['] tt-xt = if 2rdrop exit then drop 2r> fndrop ndrop 0
;
In this implementation for recognize-tick
(not tested), the phrase 'foo::bar::baz
will work correctly and returns xt of the word baz
in the wordlist bar
in the wordlist foo
, when recognize-pqname
for the syntax "::" (example) is a part of forth-recognizer
.
To implement this, we do a nesting call of the forth recognizer for another lexeme and then analyze the returned type. If the returned type is not appropriate, we drop the token (from the data stack, and from the floating-point stack, if any). So we need to be sure that calling recognize-forth-lexeme
never causes any side effect (other than stacks), even when recognizing succeeds.
NB: when recognize-tick
is a part of the current forth-recognizer
, executing of recognize-forth-lexeme
on some inputs will produce indirectly recursive call of recognize-forth-lexeme
(as intended).
ruv
Tough question: The string recognizer has a side effect, which is not good. Moving that side effect to the translator is causing other problems, because
TRANSLATE-STRING
no longer has the corresponding string on the stack, but needs parsing it later.
It's pretty allowed for a translator to parse the input buffer and/or read the input stream. Some token translators will even do nesting calls of the Forth text interpreter and can throw exceptions.
A problem that a part of the string can be in the input buffer (or even in the input stream) is solved via introducing two translators for strings: one accepts the full string from the stack (e.g.
tt-slit
), and another (e.g.tt-slit-parsing
) accepts the starting part from the stack, and the tail from the input buffer (or input stream). The string recognizer returns one or another depending whether a lexeme is a completed string, or the start of the string only.
I published a reference implementation in 2019, and now updated it for the current proposal.
A string recognizer can be as follows:
: quot ( -- sd.quot ) s\" \"" ;
: recognize-string ( sd.lexeme -- sd tt.slit|tt.slit-parsing | 0 )
quot match-head 0= if 2drop 0 exit then quot match-tail if ['] tt-slit exit then
2dup quot contains if 2drop 0 exit then \ fail if '"' is found in the middle of the string
['] tt-slit-parsing
;
BerndPaysan
The code which I have simply looks like this:
['] translate-string of json-string! endof
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
Thinking a bit more about that, I found:
- The thing you want to nest is the translator for names
- As I said, names should be first, numbers second and the rest third
- We have nestable recognizer sequences
So one solution would be to put all recognizers that return nts+translate-nt (or variants of that, e.g. locals have a variant of translate-nt that differs for postpone) in one recognizer stack, which has a name, and can be called without calling the entire recognizer stack. These recognizers have now a predictable effect, and no side effect. Since you can't tick locals, you still have to check for translate-nt
, but that's ok. You don't have to go through all weird other recognizers.
In Gforth, .recognizers
now can handle and display nested recognizers, and if you split this up like that, it would output:
.recognizers ~names ( ~nt ( Forth Forth Root ) ~scope ) ~numbers ( ~num ~float ) ~others ( ~string ~to ~dtick ~tick ~body ~complex ~env ~meta )
The ~
is there to abbreviate recognize-
(or rec-
now).
This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.
The other solution is what Gforth does: There's a ?REC-NT
which does the nesting, the checking for translate-nt, and the cleaning up of the side effects (stacks and >IN
). There is the possibility to make this more generic, e.g. create a word TRY-RECOGNIZE
which gets an xt, passes that to the result, and if that returns false, everything is cleaned up and false is returned, otherwise whatever that xt left (including the flag) is returned.
The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN
can change, it's just a little bit more hustle.
ruv
One correction. I wrote:
If we want to reflect this idea, we can use the acronym
tt
, which stands for both: token translator and token type.
It should be read as:
If we want to reflect this idea, we can use the acronym
tt
, which stands for both: "translate token" (verb) and "token type" (noun).
Data type symbol
To specify formal requirements, we have to introduce a new data type for token translators, which is a subtype of xt
. And the abbreviation tt
is a good candidate for this data type symbol.
If we will have the data type tt => xt|0
, and the symbol sd
for the string data type, the naming convention along with the stack diagram for a recognizer can be expressed as:
RECOGNIZE-{lexeme-type-symbol} ( sd.lexeme -- i*x tt ) ( F: -- j*r )
ruv
@BerndPaysan writes:
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
This would be a valid argument if it were possible to return something useful from a recognizer in all use cases but single-line string literals. But it's impossible.
For example, a recognizer for multi-line string literals cannot parse the full string literal without refilling the source (see my PoC implementation). Should we also restore the input source state to isolate side effects of recognizers?
And it still isn't enough. A recognizer for curly-based markup like foo{ any forth code bar{ nested code }bar ... }foo
cannot return something useful, since a useful thing in this case is a created definition or just a side effect of appending some semantics to the current definition. Should we also restore the state of the dictionary?
I think, it's obvious — isolation of all possible side effects of recognizers is not fruitful.
Yes, some recognizers returns objects that are not useful by themself, but they still return information what a given lexeme means, and it's an acceptable price for absent side effects for all recognizers.
Also we separate concerns into things that do have side effects (token translators) and things that don't have side effects (recognizers). It's very useful separation.
ruv
@BerndPaysan writes:
The code which I have simply looks like this:
['] translate-string of json-string! endof
A straightforward solution is to handle each token type of string literals separately. Probably, I would write it as follows:
'tt-slit of json-string! endof
'tt-slit-parsing of parse-slit-end json-string! endof
'tt-slit-ml of parse-slit-ml json-string! endof
(I would use a recognizer for a leading tick, and naming of translators in the form tt-{token-type-symbol}
)
Or I would factor a helper word as follows:
: ?prepare-tt-slit ( i*x tt -- i*x tt | sd.transient tt.slit )
case
'tt-slit of 'tt-slit endof
'tt-slit-parsing of parse-slit-end 'tt-slit endof
'tt-slit-ml of parse-slit-ml 'tt-slit endof
endcase
;
: eval-json ( .. tag -- )
?prepare-tt-slit case
...
'tt-slit of json-string! endof
...
endcase
;
ruv
Multiple entry points for the Forth recognizer
@BerndPaysan writes:
This also makes it easier to add recognizers where they belong, e.g. when you add the scope recognizer, you just push them to the end of the the names recognizer stack. If you add the floating point recognizer, the complex recognizer (both are numbers), or the hex floating point recognizer for exact notation of floating point constants, you just push them to the back of the numbers recognizer stack, and they get ahead of the others. I like this solution.
Yes, I also consider such a solution. It's a convenient solution to implement the default Forth recognizer.
But requiring the Forth recognizer to always conform this particular structure of recognizer sequences, and even always be the same instance of this structure, is too restrictive.
And otherwise you don't know the id of the actual names recognizer sequence (and even don't know whether such a sequence exists), and so you cannot check a lexeme against only this sequence (I mean, in implementation of recognize-tick
).
Filter recognizer results
Bernd, your word try-recognize
is a good factor to filter results, regardless side effects (beyond stacks). Having recognizers without side effects, it can be also implemented in a portable way.
If this word filters for a single token type, it's better to pass a corresponding tt directly (instead of xt.filter).
If this word allows to filter for multiple token types (I assume this variant), it should not drop tt from the stack.
Also, to be more useful, this word should not be bound to the current Forth recognizer only. Then, this word can be called as
apply-recognizer-filter
( sd.lexeme xt.recognizer xt.filter -- i*x tt | 0 )`.
A usage example:
: recognize-forth-name ( sd.lexeme -- nt tt.nt | 0 )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter
;
: find-forth-name ( sd.lexeme -- nt | 0 )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter if exit then 0
;
: find-forth-name? ( sd.lexeme -- nt true | false )
forth-recognizer [: dup 'tt-nt = ;] apply-recognizer-filter 0<>
;
: recognize-tick ( sd.lexeme -- xt tt.xt | 0 )
"'" match-head 0= if 2drop 0 exit then ( sd2 ) \ the input lexeme without the leading tick
forth-recognizer [: dup 'tt-xt = ;] apply-recognizer-filter
;
The cleaning up is already cumbersome, because a variable number of values can be returned on both data and floating point stack, and when in addition to that also >IN can change, it's just a little bit more hustle.
Yes, but, as I show, >in
is not enough. Also, it's better to avoid such special cases in general.
ruv
FORTH-RECOGNIZER ( -- xt )
RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.Rationale:
FORTH-RECOGNIZE
is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of usingACTION-OF
FORTH-RECOGNIZE
. The old API has this function under the nameFORTH-RECOGNIZER
(as a value) and this name is reused. Systems that want to continue to support the old API can supportTO FORTH-RECOGNIZER
, too.
There is no way for a program to check whether it can apply TO
to FORTH-RECOGNIZER
, or FORTH-RECOGNIZE
, or RECOGNIZE-FORTH-LEXEME
, etc.
Thus, TO
cannot be optional. And it cannot be mandatory too. Thus, TO
cannot be a part of the API at all — neither RECOGNIZER, nor RECOGNIZER EXT.
Then the getter and setter should be a mandatory part of the API.
ruv
In continuation to the message:
Returning something half-done isn't a good idea and makes maintaining this code difficult, as all other possibles results (like ints, floats or such) are fully converted into something useful at this stage.
This would be a valid argument if it were possible to return something useful from a recognizer in all use cases
Another example of an unuseful token is the result of the mentioned recognizer REC-TO
, which recognizes a syntax like ->foo
.
It's too restrictive to require this token be ( xt n tt )
, since in some systems it can be just ( xt.set-value tt )
, in other — ( addr.data-field xt.store tt )
.
This means that this token is not something useful to a program at all (apart of translation).
GeraldWodni
The committee thanks the authors for all the work. Here is the timetable:
- Everybody interested in this proposal: please submit your comments by end of October.
- Bernd (main author): please work this into a new version by the end of the year (2024).
- The committee will have a special interim meeting for this very proposal in February (final date will be announced in mattermost)
BerndPaysan
Concerning the setters and getters: I would prefer to make it mandatory that FORTH-RECOGNIZE
actually is a deferred word, and drop the additional getters and setters completely. DEFER
, IS
, and ACTION-OF
are all CORE EXT; so if you implement the recognizers, you have a dependency on those. The previous proposals had VALUE
and TO
and interface, which is also CORE EXT.
Gforth could support IS
and ACTION-OF
on recognizer sequences, too (i.e. assign n elements in order), through its polymorphous approach at all those words for value-style words (TO
, +TO
, ADDR
, IS
, ACTION-OF
all can do different things on different classes of values), but I guess that would be too much.
Can those setters and getters be optional in case you don't want to support DEFER
, and how can a program be written to work in both cases? If you have TOOLS EXT available, you can use
[DEFINED] is [IF]
is forth-recognize
[ELSE]
[DEFINED] to [DEFINED] forth-recognizer and [IF]
to forth-recognizer
[ELSE]
set-forth-recognizer
[THEN]
[THEN]
Yes, this is ugly and shows that having different options is not a good idea.
For the reworked proposal, I will need to restructure the proposal in a way that optional parts I rather want to remove are outlined as such, so that the final rewrite is easy.
ruv
Deferred words in API considered harmful
make it mandatory that
FORTH-RECOGNIZE
actually is a deferred word
As we have discussed, the main problem with a deferred word is that it can't be redefined by wrappers that have additional actions when setting or getting the value. In this respect, such a word in an API is as bad as an address-flavoured variable (like BASE
).
There is also a recent discussion in comp.lang.forth (link) under subjects "value-flavoured approach" and "value-flavoured structures".
Special data object on failure considered harmful
A question is what to return on failure (unsuccess): a special data object (xt of notfound
) or a common data object 0
(zero).
Below is a copy of my rationale from 2023, with some rewording.
There are two strong arguments against a special data object:
- consistency with other similar words;
- impact on the overall lexical size of programs.
Consistency
Many standard words returns some data object on success, or 0
(zero) on unsuccess/failure. This is possible because this data object cannot be 0
.
For example:
name>interpret ( nt -- xt | 0 )
find-name ( sd.name -- nt | 0 )
find-name-in ( sd.name wid -- nt | 0 )
find ( c-addr -- xt n | c-addr 0 )
search-wordlist ( sd.name -- xt n | 0 )
source-id ( -- fileid | -1 | 0 )
— not a fail, but also an example when zero was chosen instead of a special object.
Also, it is a common approach in practice. This allows common high-order functions operates on the common failure result 0
.
Why should not recognizers follow this practice? Why should they return a special id on failure rather than zero?
Lexical code size
Returning notfound
on failure makes the code shorter (in terms of lexemes) in some places. But the point is that it makes code longer in more places.
I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of a Recognizer API. In its code:
['] notfound
with=
or<>
is used 10 times, and without checking — 32 times.forth-recognize execute
is used 3 times.
If we use 0
(zero) instead of the notfound
xt, then:
['] notfound <>
is removed 5 times, which eliminates 15 lexemes;['] notfound =
is replaced with0<>
5 times, which eliminates 10 lexemes;['] notfound
is replaced with0
32 times, which eliminates 32 lexemes;- the definition for
notfound
is removed, a definition for?found
is added:: ?found ( x.some\0 -- x.some | 0 -- never ) dup 0= -13 and throw ;
, which adds not more than +3 lexemes; forth-recognize execute
is replaced withforth-recognize ?found execute
3 times, which adds +3 lexemes;- the word
?found
can be also used afterfind
,search-wordlist
,find-name
,find-name-in
— when the user needs to execute their result at once, and unsuccess should produce an exception.
Thus, replacing of notfound
by zero reduces the overall lexical code size in Gforth by more than 51 lexemes, which is more than 0.4KiB in absolute size (as on 2023-09-17).
So why should we prefer an approach that increases the overall lexical size of programs?
AntonErtl
About the proposal text
The "Problem" section does not describe a problem of Forth-2012 that the proposal wants to solve, but considers a problem with some other recognizer proposal. Similarly, the "Solution" section refers to some other recognizer proposal. This makes these sections useless for readers who have not first read up on the other proposal, which is not even linked here. Parts of the "Solution" section might be useful in another section on transitioning from the earlier proposal.
Instead, the "Problem" and "Solution" sections should describe what benefits this proposal adds to the standard, and how. A possible "Discussion" section and its subsections should describe the benefits of the present approach over possible alternative approaches (if that's too detailed, lazy system implementors will complain about the length of the proposal, but some complaints should just be ignored).
"Typical use" should of course be presented.
State-dependence
The proposal in its present form is unacceptable to me because it
defines a defining word TRANSLATE:
for state-dependent words, and
expects recognizers to produce the xt of state-dependent words. This
makes the translators hard to use anywhere except in INTERPRET
; the
proposed-for-standard interface is even hard (actually impossible with
standard means) to use in POSTPONE
, which is an intended user of
translators, as the proposal admits itself:
POSTPONE can do that without a standardized way
Another problem with the state-dependent translators is that it leads to either handwaving specifications of what they do, as evidenced in XY.3.1:
TRANSLATE-THING ( jx ix -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
in the non-specification of what translator-xt does in
FORTH-RECOGNIZE
and the handwaving specification of "name:" in
TRANSLATE
:
"name:" ( jx ix -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
and the nonspecification of what TRANSLATE-NT
, TRANSLATE-NUM
,
TRANSLATE-DNUM
, TRANSLATE-FLOAT
, and TRANSLATE-STRING
do.
Or if you specify exactly what happens, it leads to lengthy texts that
explain the state-dependence, and the three different cases. And you
cannot even specify when xt-post is performed, because there is no
"postpone state" in the standard. On the contrary the current
document specifies that STATE
is either 0 (interpretation state) or
non-zero (compilation state), without any values left for a postpone
state, and specifies only words for getting into interpretation state
and compilation state, not postpone state.
If you really believe that the state-dependent approach is a good idea, please specify all these words exactly; the editor won't do it for you.
Opaque solution
If there is no need to make POSTPONE implementable in a standardized way, there is no need to make INTERPRET (which is not even standardized) implementable in a standardized way, either, and the translators can become a completely opaque thing that the standard does not document. In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token, and standard programs can only use that for implementing recognizers, but not for implementing text interpreters, POSTPONE, or anything else.
Transparent solutions
Alternatively, we might heed "Don't bury your tools!" and have a more useful interface for translators, like what we have seen in earlier drafts and other recognizer proposals.
POSTPONE
If the idea of the proposal is that xt-post is actually used by POSTPONE, the proposal should specify the change to POSTPONE.
Standardize recognizers
I expect that more people will want to compose existing recognizers into recognizer sequences than to define new recognizers, but they usually need to know about existing recognizers in order to do that. Therefore the proposal (or an accompanying proposal) should not just propose standard translators, but also standard recognizers.
ruv
@AntonErtl writes:
The proposal in its present form is unacceptable to me because it defines a defining word
TRANSLATE:
for state-dependent words, and expects recognizers to produce the xt of state-dependent words.
I do not like TRANSLATE:
either, but for a different reason. Sometimes it is very convenient to define a translator as a quotation (right inside the recognizer), and if you are forced to define a translator only with TRANSLATE:
, you cannot define it as a quotation.
This makes the translators hard to use anywhere except in
INTERPRET
;
Could you provide some examples, please? It seems, this is not harder than performing the observable interpretation semantics using the result of name>interpret
.
BerndPaysan
Concerning explicit access methods to xt-int/xt-comp/xt-post, I can offer the following compromise, as a result of observations made:
It turns out that you can not access xt-int and xt-comp by setting STATE
, executing the translator, and then reverting STATE
to the value before, because words can change STATE
as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.
However, it turns out that you can access xt-post that way, because the only word that possibly changes that state is [[
, and that token is a) no visible at all to POSTPONE
, and b) changes the state back to compilation state, the state POSTPONE
was in anyhow.
So if your system allows full explicit access to all three possible states, all translators have to be defined by 'TRANSLATE:', and I can offer you three access methods. If you only want to implement POSTPONE
, the following definition actually works:
: postpone ( "string" -- )
parse-name forth-recognize ?found
state @ >r -2 state ! execute r> state ! ; immediate
Further observations:
Gforth has >INTERPRET
and >COMPILE
, and doesn't use them, only >POSTPONE
is used. In exactly one place, in POSTPONE
. All other invocations are through EXECUTE
or only taking the data. The rest is implementation, including the extension towards more of those access methods for more, user-defined states. The question is whether you need to standardize a tool that has no use case, even if you don't bury it.
A possible way to deal with this is to move this out to a separate proposal.
What has been quite useful is the EXECUTE
interface for user-written interpreters, because these are interpret-only, and don't need the complication of state-dependent translators at all.
BerndPaysan
Sleeping over it added a few ideas:
The invocation through changing STATE
and restoring it works (in general) for translators that will definitely not change STATE
as part of their own operation, e.g. translators for literals. It also works (as a special case) for POSTPONE
, so a standard implementation of POSTPONE
using that method is possible. The postpone mode itself, which needs to change STATE
at [[
relies on the dispatch through STATE
without setting and restoring STATE
around the invocation, so it also works.
The question here is not if that implementation is a quality implementation, but whether it's not so bad that it is another bag full of inconsistencies. IMHO, TRANSLATE-NT
will have demonstrable inconsistencies when not using the clean TRANSLATE:
interface, but combined literal translators won't. For the cleaner interface outside of POSTPONE
itself (which is special case enough to not require the cleaner interface), we have to demonstrate that there is an actual use case. So far, we don't have one.
Both POSTPONE
with the additional functionality and the postpone mode ]]
… [[
will become part of the proposal.
ruv
@BerndPaysan writes:
It turns out that you can not access xt-int and xt-comp by setting
STATE
, executing the translator, and then revertingSTATE
to the value before, because words can changeSTATE
as part of their interpretation or compilation semantics, and in that case, the state change is a desired result of performing interpretation or compilation semantics.
This is wrong. Yes, the state change can be a desired result of interpretation or compilation semantics, but this does not prevent us from performing the interpretation or compilation semantics regardless the initial value of STATE
, as I shown many times.
We can use the following helpers for that.
\ Useful factors
: compilation ( -- flag ) state @ 0<> ;
: enter-compilation ( comp: false -- true | comp: true -- true ) ] ;
: leave-compilation ( comp: true -- false | comp: false -- false ) postpone [ ;
\ For the execution semantics identified by xt,
\ perform the part that can be observed in interpreted state.
: execute-interpreting ( i*x xt -- j*x )
compilation 0= if execute exit then
leave-compilation execute enter-compilation
;
\ For the execution semantics identified by xt,
\ perform the part that can be observed in compilation state.
: execute-compiling ( i*x xt -- j*x )
compilation if execute exit then
enter-compilation execute leave-compilation
;
If we have a result of recognizing with the xt of a translator at the top (i.e., a fully qualified token), and we want to perform the corresponding interpretation semantics regardless of the current value of STATE
, we should execute this xt with execute-interpret
. If we want to perform the corresponding compilation semantics regardless of the current value of STATE
, we should execute it with execute-compiling
. If we want to perform the semantics according to STATE
, we should just execute this xt with execute
.
The key point in the implementation of execute-interpreting
and execute-compiling
is that we do not save/restore STATE
if it matches the semantics we want to perform — and if changing STATE
is part of the semantics, STATE
will be changed. On the other hand, if STATE
does not match the semantics we want to perform, we change STATE
and then restore it — if changing STATE
is part of the semantics, then it will change STATE
to the same value that was saved and to one we restore it to. Thus, the resulting STATE
will be as expected!
NB: execute-interpreting
and execute-compiling
are also required if we want to perform the interpretation semantics or compilation semantics from an nt, regardless the current value of STATE
. Moreover, these words are required even in the old approach for Recognizer API, which provides the words RECTYPE>INT
and RECTYPE>COMP
— because these words have the same flaw for state-dependent words as NAME>INTERPRET
and NAME>COMPILE
.
ruv
@AntonErtl writes about token translators:
if you specify exactly what happens, it leads to lengthy texts that explain the state-dependence, and the three different cases. And you cannot even specify when xt-post is performed, because there is no "postpone state" in the standard. On the contrary the current document specifies that STATE is either 0 (interpretation state) or non-zero (compilation state), without any values left for a postpone state, and specifies only words for getting into interpretation state and compilation state, not postpone state.
This is reasonable. And we also discussed in the Recognizer chat group that the standard does not imply such a state as postponing (for the Forth text interpreter).
In my opinion, these problems can be avoided.
We should specify "to translate a token" and "token translator" in the common sections of term definitions, data types and usage requirements. Then, we do not need to repeat that for every token translator. It will be enough to specify that a word is a token translator, and the data type of the token (that it translates).
We can have a word like
postpone-token ( qt -- )
that append the compilation semantics of a lexeme, which was recognized as qt, to the current definition. (qt is a qualified token, which is a pair of an unqualified token and token translator ( uq tt ))
So, any additional state, if any, is encapsulated into postpone-token
. The standard should not specify it.
Thus, postpone
can be defined like this (in my parlance):
: postpone ( "name" -- )
parse-lexeme perceive ?found postpone-token
;
How postpone-token
finds/performs the postponing action from tt — it's an internal problem of implementation. The word postpone-token
should throw the exception -32 "invalid name argument" if a postponing action is not associated with tt.
We need to provide a way to associate a postponing action (an xt) with a tt, or to create a new tt from an xt and tt. The postponing action should be optional. The user needs to provide a postponing action only if they want to make postpone
applicable to the corresponding lexemes.
For example, we can have an optional word postponable ( tt1 xt.postponing -- tt2 )
. Probably, this word shall return the same tt2 for the same input pair ( tt1 xt.postponing )
.
This word is optional, because it can be implemented along with postpone-token
in a standard program, and postpone
can be redefined to use then.
ruv
@AntonErtl writes:
In that case there is also no need for the translators to actually be executable. The recognizer could return an opaque translation token,
I researched this approach.
In general, a recognizer returns a qualified token (qt) on success, where a qualified token is a pair of an unqualified token (uq) and a token descriptor (td).
Data type relations:
- unqualified token:
ut => ( S: i*x F: k*r )
- token descriptor:
td => x\0
- qualified token:
qt => ( ut td )
It is always possible to define a word translate-qtoken ( any qt -- any )
, which translates a qualified token (i.e., performs the interpretation or compilation semantics for the corresponding recognized lexeme depending on STATE). And as practice shows, it is very useful and in demand.
Additionally, in Forth, it is always technically possible to make the token descriptor also a token translator (that is a subtype of the execution token), without any loss (see an example).
- token translator:
tt => xt ; td = tt
So, instead of using a separate word translate-qtoken
, we can use the word execute
. And the Forth text interpreter simply executes the token translator (instead of applying translate-qtoken
to qt). Note that regardless whether the token descriptor is a subtype of the execution token, the token descriptor is opaque for the Forth text interpreter. The only difference is whether translate-qtoken
or execute
is using by the Forth text interpreter.
The big advantage of token translators is that they can be defined inline as quotations, and they can be used to define other token translators. This simplifies programs and reduces the lexical size of programs.
Also, token translators allow us to define dual-semantics words simpler.
For example, this is a definition for [']
, which has the expected interpretation semantics:
: ['] ( -- xt | ) ' tt-xt ; immediate
See also in my gist the word missing(
, which has the expected interpretation and compilation semantics. Without token translators such words are more difficult to implement.
AntonErtl
Concerning the supposed lack of use cases: I have mentioned use cases where the state
-based interface is at the very least cumbersome in r1038.
state
is a bad idea, as demonstrated by the problems mentioned above. We are stuck with it for the existing system, but we must not put state
in new interfaces, much less in new defining words.
As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone
. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface. And it must not be state
-dependent.
ruv
Use cases
Concerning the supposed lack of use cases: I have mentioned use cases where the state-based interface is at the very least cumbersome in r1038.
Do you mean this example: "you cannot implement postpone
or ]]...[[
as standard-compliant code"?
Then it's unclear what specification for these words you cannot implement? Because:
- You provided a portable (standard compliant) implementation for
]] ... [[
(based onpostpone
). This implementation does not depend on howpostpone
is implemented. - A standard
postpone
can be implemented usingfind
orfind-name
. An advancedpostpone
can be also implemented in a standard-compliant way.
Could you please clarify?
Probably you mean that the user should be able to create a recognizer and assign it to the perceptor, and then postpone
(and ]] ... [[
that uses this postpone
) should be applicable to lexemes that this recognizer recognizes. But I do not see any connection to the state-based interface too.
state
is a bad idea, as demonstrated by the problems mentioned above.
I implemented postpone
in four different approaches (see fep-recognizer/implementation/variant.gamma/postpone/index.fth) in my "gamma" reference implementation for Recognizer API.
This reference implementation is portable and can be loaded in Gforth as
gforth implementation/index.fth
In every approach I defined the interpretation semantics for postpone
, so postpone
depends on state
. In every approach the words compile-postpone-qtoken ( qt -- )
and translate-postpone-qtoken ( any qt -- any )
are provided. The former does not depend on state
, the later does depend on state
.
In the variant postpone/auto.via-mmode.fth the macro-compilation mode is employed (one more state, if you like). By default, namely this variant is loaded in the current version (Commit f3b7d01). The macro-compilation mode is very useful because it also allows to implement a more useful and advanced variant than your construct ]] ... [[
.
Could you please demonstrate a problem concerning state-dependency in any of these approaches?
As for opaque vs. transparent: Opaque would only be an option if the only use of translators was really in the text interpreter and in postpone. But if we want to support other use cases (and there are other use cases, as discussed above), we should do a transparent user interface.
Could you please provide a practical example when you need a transparent token descriptor structure?
BerndPaysan
Using EXECUTE
instead of a special translator-specific word allows to use the rest of the recognizer API for interpreters that don't have any state at all. This actually happens and is useful; e.g. the parser in net2o's chat system uses that. There's absolutely no need for any other mode than directly interpreting. And using EXECUTE
does not mean you have to set STATE
if you call a translator for a particular state (interpreting/compiling/postponing) directly. Though there are likely confusing results if you do so and the word executed is a state-smart word. The amount of surprise level is likely small, because so far, the only direct access method actually useful is the one for the postpone action. And that never executes the word found.
I don't want to mandate a particular implementation. Choose the implementation you like. I'll add an API that allows direct and default invocation of a translator. I'm not sure if I want this in the same proposal or split it into another one, so we can vote on those separately.
ruv
@AntonErtl wrote in r1038
when you write a state-independent text interpreter, such as a polyForth-style text interpreter, or colorforth-bw, you would have to set
state
before executing the translators, which is perverse.
It is not more perverse than repeating «_
», «]
», or «[
» before each lexeme in a program ;)
In general, when the Recognizer word set is provided, the Forth text interpreter itself knows nothing about STATE
. If you write a state-independent text interpreter, your recognizers should provide state-independent token translators. And you have not to set state
at all. I rewrote your colorforth-bw example in Recognizer API. It just works. Note how translators are embedded into recognizers using quotations (in Commit 6c72064).
And in the case of colorforth-bw, again there is no standard way to set
state
to get the translator to perform xt-post.
In this approach, why do you need to write «[postpone _foo
» instead of «]foo
» ?
AntonErtl
@BerndPaysan:
If you eliminate the state-dependence of translators, then text
interpreters that use more than just the xt-int action (e.g., the one
for colorforh-bw, see below) can be written without having to deal with state
.
And text interpreters that use xt-post can be written using the
proposed wordset rather than having to use a detour through postpone
(which is a parsing word, possibly introducing additional
complications).
The following is also relevant to @ruv:
Ruv's colorforth-bw
implementation
demonstrates the shortcomings of the present proposal, because it does
not use recognizers nor translators at all for implementing
recognize-colorforth-bw
; instead, it reimplements everything that the name
recognizer and the number recognizer already do internally, nicely
demonstrating that the present proposal buries the tools. And it only
implements dealing with names and single-cell numbers. Finally, the
implementation is so long (44 lines without putting it into
forth-recognize
) that you have not shown it inline, but posted a
link to github.
By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):
defer recognizer1 forth-recognizer is recognizer1
: prefix>index ( c -- n )
case
'[' of 0 endof
'_' of -1 endof
']' of -2 endof
1 swap
endcase ;
: rectype-colorforth-bw ( ... rectype index state -- ... )
drop \ we use index, not the surrounding Forth interpreter's state
swap execute ;
: recognize-colorforth-bw ( c-addr u -- )
dup 0= if 2drop ['] notfound exit then
over c@ prefix>index dup 0 > if 2drop drop ['] notfound exit then
>r 1 /string recognizer1 r> ['] rectype-colorforth-bw ;
' recognize-colorforth-bw set-forth-recognize
This has only 20 lines (vs. 44), and it uses all the recognizers
originally present in forth-recognizer
(name, integers (including
doubles), FP, etc.). This demonstrates the superior expressive power
of the rectypes from [160] over the translators from [r1081].
BTW, I find the presence of both forth-recognize
and
forth-recognizer
confusing, and would prefer to define
forth-recognize
as deferred word. If you have to have getters and
setters, call the getter get-forth-recognize
.
In this approach, why do you need to write «[postpone _foo» instead of «]foo» ?
Nobody is suggesting that. But you need to perform xt-post in order
to implement ]foo
. In your implementation, you do it by
reimplementing xt-post for the two recognizers you implement
internally to recognize-colorforh-bw
. If you would use a detour
through postpone
instead, you would use the xt-post invoked in that
way. And in my implementation above, xt-post is invoked directly.
ruv
@AntonErtl writes:
Ruv's colorforth-bw implementation demonstrates the shortcomings of the present proposal, because it does not use recognizers nor translators at all for implementing
recognize-colorforth-bw
; instead, it reimplements everything that the name recognizer and the number recognizer already do internally,
It's wrong. Have a look at L18-L19:
\ Reuse a recognizer for numbers
['] recognize-number-n-prefixed apply-recognizer-cf dup 0= if exit then
It uses the recognizer for numbers. And it uses find-name
instead of the recognizer for names (Forth words) just because it's simpler in this case. It does not reuse token translators.
And it only implements dealing with names and single-cell numbers.
Because your original example implemented only that. And I just rewrote your original example.
Finally, the implementation is so long (44 lines without putting it into forth-recognize) that you have not shown it inline, but posted a link to github.
Why count 10 lines of comments at the beginning of the file? Without comments, 31 lines, the same as in your example (lexical size is greater due to nt vs xt, and improvements in the behavior).
By contrast, let's take much of the proposal from [r1081], but replace the state-dependent translators with the state-independent rectypes of [160]. With such a proposal, colorforth-bw might look as follows (untested):
[...]
This has only 20 lines (vs. 44), and it uses all the recognizers originally present in forth-recognizer (name, integers (including doubles), FP, etc.). This demonstrates the superior expressive power of the rectypes from [160] over the translators from [r1081].
(I corrected the [r1081] link in the citation above)
This comparison is incorrect. Below is an implementation against the latest API version (except compile-postpone-qtoken
that is a variation of discussed postpone-qtoken
, which should be either present or implementable in any variant of API):
: cf-prefix>tt? ( c -- tt true | c false )
case
'[' of ['] execute-interpreting endof
'_' of ['] execute-compiling endof
']' of ['] compile-postpone-qtoken endof
0 exit
endcase true
;
defer recognize-default perceptor is recognize-default
: recognize-colorforth-bw ( sd.lexeme -- qt|0 )
dup 0= if nip exit then
over c@ cf-prefix>tt? 0= if drop 2drop 0 exit then
>r 1 /string recognize-default dup if r> exit then rdrop
;
16 lines.
Can be tested in Gforth too:
gforth index.fth example/recognize-colorforth-bw.fth
:noname cf( _1. _drop _s" foo" ) ; execute s" foo" compare 0= .s \ prints "1 -1"
AntonErtl
The latest proposal is [r1081] and it does not contain execute-interpreting
, execute-compiling
, compile-postpone-qtoken
, or perceptor
. And that's what we were tasked with discussing and giving feedback on. And that's what I did.
ruv
The latest proposal is [r1081] and it does not contain
execute-interpreting
,execute-compiling
,compile-postpone-qtoken
, orperceptor
. And that's what we were tasked with discussing and giving feedback on. And that's what I did.
I see, thank you. Actually, [r1081] is outdated, a new version will be prepared soon and then it should be discussed (was noted in the recognizer chat). Nevertheless, my example implementation for recognize-colorforth-bw
above is compatible with [r1081] with the following exceptions: it relies on 0
instead of NOTFOUND
(you should note how it makes things simpler), and it uses the method compile-postpone-qtoken
that appends the compilation semantics of a qualified token to the current definition (this method is missing in [r1081]). The word perceptor
is simply a better name than forth-recognizer
in [r1081] (I just posted in ForthHub/fep-recognizer a rationale from the chat).
The words execute-interpreting
and execute-compiling
are general words that are needed anyway to perform interpretation or compilation semantics regardless the initial STATE
, they are implemented in the standard Forth as:
: compilation ( comp: true ; S: -- true ; | comp: false ; S: -- false ; ) state @ 0<> ;
: enter-compilation ( comp: false -- true ; S: -- ; | comp: true ; S: -- ; ) ] ;
: leave-compilation ( comp: true -- false ; S: -- ; | comp: false ; S: -- ; ) postpone [ ;
: execute-interpreting ( i*x xt -- j*x )
compilation 0= if execute exit then
leave-compilation execute enter-compilation
;
: execute-compiling ( i*x xt -- j*x )
compilation if execute exit then
enter-compilation execute leave-compilation
;
ruv
@AntonErtl writes:
If you eliminate the state-dependence of translators, then text interpreters that use more than just the xt-int action (e.g., the one for colorforh-bw, see below) can be written without having to deal with state.
Token translators cannot be written without having to deal with state (possibly indirectly), by the term definition. A token translator shall perform different actions depending on the state, and it does not matter how the state is passed to the translator: though the data stack, through a separate stack intended for this purpose, or though an internal variable. The state does not matter in only one case: if the translator shall perform the same action regardless of the state.
Moreover, if you pass a parameter that encodes compilation state or interpretation state not through STATE
, you have to make STATE
to be in sync with this parameter to guarantee that STATE-dependent words are translated correctly.
BerndPaysanNew Version: minimalistic core API for recognizers
Minimalistic Recognizer API
Author:
Bernd Paysan
Change Log:
- 2020-09-06 initial version
- 2020-09-08 taking ruv's approach and vocabulary at translators
- 2020-09-08 replace the remaining rectypes with translators
- 2022-09-08 add the requested extensions, integrate results of bikeshedding discussion
- 2022-09-08 adjust reference implementation to results of last bikeshedding discussion
- 2022-09-09 Take comments from ruv into account, remove specifying STATE involvement
- 2022-09-10 More complete reference implementation
- 2022-09-10 Add use of extended words in reference implementation
- 2022-09-10 Typo fixed
- 2022-09-12 Fix for search order reference implementation
- 2022-09-15 Revert to Trute's table approach to call specific modes deliberately
- 2023-08-08 Remove names for table access words; there's no usage outside POSTPONE seen; POSTPONE can do that without a standardized way.
- 2023-09-11 Remove the role of system components for TRANSLATE-NT and TRANSLATE-NUM
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
- 2023-09-13 Make clear that
TRANSLATE:
is the only way to define a standard-conforming translator.
- 2023-09-15 Add list of example recognizers and their names.
- 2024-12-15 Take comments after freezing the proposal into account
Problem:
Problem
The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.
The Forth compiler can be extended easily. The Forth interpreter however has a fixed set of capabilities as outlined in section 3.4 of the standard text: Words from the dictionary and some number formats.
Solution:
It's not possible to use the Forth text interpreter in an application or system extension context. Most interpreters in existing systems use a number of hooks to extent the interpreter. That makes it possible to use a loadable library to implement new data types to be handled like the built-in ones. An example are the floating point numbers. They have their own parsing and data handling words including a stack of their own.
Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.
Furthermore applications need to use system provided and system specific
words or have to re-invent the wheel to get numbers with a sign or
hex numbers with the $ prefix. The building blocks (FIND
, COMPILE,
,
>NUMBER
etc) are available but there is a gap between them and what
the Forth interpreter already does.
Important changes to the original proposal:
The Forth interpreter is stateful, but the API should avoid the problems of
the STATE
variable. In particular, an implementation without STATE
should
be possible, and there is only one place where the stateful dispatch is
necessary.
- Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating a special implementation
Solution
This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.
The monolithic design of the Forth interpreter is factored into three major blocks:
The core principle is still that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
- The interpreter. It extracts sub-strings (lexemes) from
SOURCE
, hands them over to the data parsing and processes the results.
: rec-xt ( addr u -- translator )
The actual data parsing. It analyses lexemes whether they match the criteria for a certain token type. These words, called recognizers, can be grouped to achieve an order of invocation.
The result of the recognizer, a translator and associated data, is handed over to the interpreter.
There is no strict 1:1 relation between a recognizer and the returned translator. A translator for e.g. single cell numbers can be used by different recognizers, a recognizer can return different translators (e.g. single and double cell numbers).
Whenever the Forth text interpreter is mentioned, the standard
words EVALUATE
(CORE), '
(tick, CORE), INCLUDE-FILE
(FILE), INCLUDED
(FILE), LOAD
(BLOCK) and THRU
(BLOCK)
are expected to act likewise. This proposal is not about to change
these words, but to provide the tools to do so. As long as the
standard feature set is used, a complete replacement with
recognizers is possible.
Important changes to the Matthias Trute proposal:
- Make the translators executable to dispatch according to the state (interpreting, compiling, postponing) themselves
- Use dedicated invocation methods to call a translator for a particular state
- Make the recognizer sequence executable with the same effect as a recognizer
- Make sure the API is not mandating any particular implementation
The core principle is that the recognizer is not aware of state, and the returned translator is. If you have for some reason legacy code that looks like
: recognize-xt ( addr u -- translator-stub | 0 )
here place here find dup IF
0< state @ and IF compile, ELSE execute THEN ['] drop
ELSE drop ['] notfound THEN ;
0< state @ and IF compile, ELSE execute THEN ['] noop
THEN ;
then you should factor the part starting with state @ out and return it as translator:
then you should factor the part starting with STATE @
out and return it as
translator:
: translate-xt ( xt flag -- )
0< state @ and IF compile, ELSE execute THEN ;
: rec-xt ( addr u -- ... translator )
here place here find dup IF ['] translate-xt
ELSE drop ['] notfound THEN ;
: recognize-xt ( addr u -- ... translator | 0 )
here place here find dup IF ['] translate-xt THEN ;
In a second step, you need to remove the STATE @ entirely and use TRANSLATE:, because otherwise POSTPONE won't work. If you are unclear about what to do on postpone in this stage, use -48 throw
, otherwise define a postpone action:
In a second step, you need to remove the STATE @
entirely and use
TRANSLATE:
. If you don't know what to do on postpone in this stage,
use -48 throw
, otherwise define a postpone action:
:noname ( xt flag -- ) drop execute ;
:noname ( xt flag -- ) 0< IF compile, ELSE execute THEN ;
:noname ( xt flag -- ) 0< IF postpone literal postpone compile, ELSE compile, THEN ;
translate: translate-xt
Typical use
The standard interpreter loop should look like this:
: interpret ( i*x -- j*x )
BEGIN parse-name dup WHILE forth-recognize execute REPEAT
BEGIN parse-name dup WHILE forth-recognize ?found execute REPEAT
2drop ;
with the usual additions to check e.g. for empty stacks and such.
Typical use
Operating a recognizer in a particular state, e.g. to postpone a single word, do
TBD
: postpone ( "name" -- )
parse-name forth-recognize ?found postponing ; immediate
to optain an xt for a name, use something like that:
: ' ( "name" -- xt )
parse-name forth-recognize ?found
['] translate-nt <> #-32 and throw
name>interpret ;
Proposal:
XY. The optional Recognizer Wordset
XY. The optional Recognizer Wordset
A recognizer takes the string of a lexeme and returns a translator xt and additional data on the stack (no additional data for NOTFOUND
):
XY.1 Introduction
REC-SOMETYPE ( addr len -- i*x translate-xt | NOTFOUND )
Recognizers have the form
REC-
SOMETYPE ( addr len -- i*x j*r translate-xt | 0/NOTFOUND )
A recognizer takes the string addr len of a lexeme and on success returns a translator translate-xt and additional data on the data and floating point stack.
[IF] NOTFOUND=0
If it fails, it returns 0.
[ELSE] NOTFOUND=xt
If it fails, it returns the xt of NOTFOUND
.
For clarity, unless this issue is decided, the non-success return value of a recognizer is notated as 0/NOTFOUND. The reference implementation uses the option 0.
[THEN] notfound
[IF] side-effect
A recognizer shall not have a side effect.
Rationale: Side effects are supposed to all happen inside the translators.
This promise allows to try recognize something and fail if the result is not
desired without having to roll back unkown changes. Examples: The tick and to
recognizer pass a substring of the to be translated string to
FORTH-RECOGNIZE
, and fail if the result is not a name type.
[THEN] side-effect
XY.3 Additional usage requirements
XY.3.1 Translator
translator: subtype of xt, and executes with the following stack effect:
translator: named subtype of xt, and executes with the following stack effect:
TRANSLATE-THING ( j*x i*x -- k*x )
name ( j*x i*x -- k*x )
A translator xt that interprets, compiles or postpones the action of the thing according to what the state the system is in.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the thing.
i*x is the additional information provided by the recognizer, j*x and k*x are the stack inputs and outputs of interpreting/compiling or postponing the recognized lexeme.
XY.6 Glossary
XY.6.1 Recognizer Words
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | NOTFOUND-xt ) RECOGNIZER
FORTH-RECOGNIZE ( addr len -- i*x translator-xt | 0/NOTFOUND ) RECOGNIZER
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or NOTFOUND
if not.
Takes a string and tries to recognize it, returning the translator xt and additional information if successful, or 0/NOTFOUND if not.
NOTFOUND ( -- ) RECOGNIZER
[IF] defer
Performs -13 THROW
. If the exception word set is not present, the system shall use a best effort approach to display an adequate error message.
FORTH-RECOGNIZE
is a deferred word. Changing the system recognizer can be
done with IS FORTH-RECOGNIZE
, obtaining the system recognizer with
ACTION-OF FORTH-RECOGNIZE
.
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER EXT
Rationale: use existing API to change it; most simple system have this available, and advanced systems have capabilities to work around limitations.
Create a translator word under the name "name". This word is the only standard way to define a translator.
[ELSE] setter and getter
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current mode.
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
Rationale: The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE
to keep the API small. There may be other, non-standard modes of operation, where the individual component xts are accessed STATE
-independently, which only works on translators created by TRANSLATE:
(e.g. for implementing POSTPONE
), so any other way to define a translator is non-standard.
Assign the recognizer xt to FORTH-RECOGNIZE.
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE
.
Rationale: not sufficiently advanced systems can work around the limitations
of IS
and ACTION-OF
better with this API.
[THEN]
TRANSLATE: ( xt-int xt-comp xt-post "name" -- ) RECOGNIZER
Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.
"name:" ( j*x i*x -- k*x ) performs xt-int in interpretation, xt-comp in compilation and xt-post in postpone state using a system-specific way to determine the current state.
Rationale: The by far most common usage of translators is inside the outer
interpreter, and this default mode of operation is called by EXECUTE
to keep
the API small. You can not simply set STATE
, use EXECUTE
and afterwards
restore STATE
to perform interpretation or compilation semantics, because
words can change STATE
, so you need the words INTERPRETING
and COMPILING
defined below. This problem does not apply to POSTPONING
, so systems that
only want to implement direct access to POSTPONE
mode can get away without
TRANSLATE:
.
[IF] NOTFOUND=0
?FOUND ( translator-xt -- translator-xt | 0 -- never ) RECOGNIZER
Check if the recognizer was successful, and if not, perform a -13 THROW
or
display an appropriate error message if the exception wordset is not present.
[THEN] NOTFOUND=0
XY.6.2 Recognizer Extension Words
SET-FORTH-RECOGNIZE ( xt -- ) RECOGNIZER EXT
[IF] NOTFOUND=0
Assign the recognizer xt to FORTH-RECOGNIZE.
?NOTFOUND ( translator-xt -- translator-xt | 0 -- addr u notfound-xt )
Rationale:
Check if the recognizer was successful. If not, replace the 0 result with the
addr u of the last scanned lexeme, and put the xt of the NOTFOUND
translator on top of the stack.
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise can use this word to change the behavior instead of using IS FORTH-RECOGNIZE
.
NOTFOUND ( -- never ) RECOGNIZER
FORTH-RECOGNIZER ( -- xt ) RECOGNIZER EXT
Translator for unsuccessful recognizers: perform a -13 THROW
.
Obtain the recognizer xt that is assigned to FORTH-RECOGNIZE.
[THEN] NOTFOUND=0
Rationale:
POSTPONE ( "<spaces>lexeme" -- ) RECOGNIZER
FORTH-RECOGNIZE is likely a deferred word, but systems that implement it otherwise, can use this word to change the behavior instead of using ACTION-OF FORTH-RECOGNIZE
. The old API has this function under the name FORTH-RECOGNIZER (as a value) and this name is reused. Systems that want to continue to support the old API can support TO FORTH-RECOGNIZER
, too.
Compilation: recognize lexeme. On success, perform the postpone action of
the returned translator, otherwise -13 THROW
or display the appropriate
error message if the exception wordset is not present.
RECOGNIZER-SEQUENCE: ( n*xt n "name" -- ) RECOGNIZER EXT
RECOGNIZER-SEQUENCE: ( xt1 .. xtn n "name" -- ) RECOGNIZER EXT
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with the topmost xt on stack and proceeding towards the bottommost xt until successful.
Create a named recognizer sequence under the name "name", which, when executed, tries to recognize strings starting with xtn on stack and proceeding towards xt1 until successful.
SET-RECOGNIZER-SEQUENCE ( n*xt n xt-seq -- ) RECOGNIZER EXT
SET-RECOGNIZER-SEQUENCE ( xt1 .. xtn n xt-seq -- ) RECOGNIZER EXT
Set the recognizer sequence of xt-seq to xt1 .. xtn.
Set the recognizer sequence of xt-seq to xt1 .. xtn.
GET-RECOGNIZER-SEQUENCE ( xt-seq -- n*xt n ) RECOGNIZER EXT
GET-RECOGNIZER-SEQUENCE ( xt-seq -- xt1 .. xtn n ) RECOGNIZER EXT
Obtain the recognizer sequence xt-seq as n*xt n.
Obtain the recognizer sequence from xt-seq as xt1 .. xtn n.
TANSLATE-NT ( j*x nt -- k*x ) RECOGNIZER EXT
Translates a name token.
Translates a name token:
TRANSLATE-NUM ( n -- n | ) RECOGNIZER EXT
Interpretation: perform the interpretation semantics of the word
Translates a number.
Compilation: perform the compilation semantics of the word
TRANSLATE-DNUM ( d -- d | ) RECOGNIZER EXT
Postpone: append the compilation semantics above to the current definition
Translates a double number.
REC-NT ( addr u -- nt translate-nt | 0/NOTFOUND ) RECOGNIZER EXT
Search the dictionary for the string addr u. If successful, return the nt
and the xt of TRANSLATE-NT
. If the search fails, return 0/NOTFOUND.
TRANSLATE-NUM ( x -- x | ) RECOGNIZER EXT
Translates a number:
Interpretation: keep the number on the stack
Compilation: Append the run-time defined in LITERAL
to the current definition
Postpone: Append the compilation semantics above to the current definition
TRANSLATE-DNUM ( x1 x2 -- x1 x2 | ) RECOGNIZER EXT
Translates a double number:
Interpretation: keep the numbers on the stack
Compilation: Append the run-time defined in 2LITERAL
to the current definition
Postpone: Append the compilation semantics above to the current definition
REC-NUM ( addr u -- x translate-num | xd translate-dnum | 0/NOTFOUND ) RECOGNIZER EXT
Convert addr u to a number x and the xt of TRANSLATE-NUM
as specified in
3.4.1.3 or a double number xd and the xt of TRANSLATE-DNUM
as
specified in 8.3.1 if the double number wordset is available. If the
conversion fails, return 0/NOTFOUND.
TRANSLATE-FLOAT ( r -- r | ) RECOGNIZER EXT
Translates a floating point number.
Translates a floating point number:
TRANSLATE-STRING ( addr u -- addr u | ) RECOGNIZER EXT
Interpretation: Keep r on the stack
Translates a string.
Compilation: Append the run-time defined in FLITERAL
to the current definition
Postpone: Append the compilation semantics above to the current definition
REC-FLOAT ( addr u -- r translate-float | 0/NOTFOUND ) RECOGNIZER EXT
Convdert addr u to a number r specified in 12.3.7 if the float wordset is availabe; if the conversion fails, return 0/NOTFOUND.
SCAN-TRANSLATE-STRING ( addr1 u1 string-rest<"> -- addr2 u2 | ) RECOGNIZER EXT
Complete parsing a string: addr1 u1 consists of the starting quote and
additional characters up to the first space in the string. addr2 u2
consists of the entire string without the starting quote up to (but not
including) the final quote, and translated the escape sequences according to
the rules of S\\"
. >IN
is modified appropriately, and points just after
the final quote. If there's no final quote in the current line, REFILL
can
be used to read in more lines, adding corresponding newlines into the string.
The final quote can be inside addr1 u1, setting >IN
backwards in that
case.
Translate the string:
Interpretation: keep the string on the stack
Compilation: Append the run-time defined in SLITERAL
to the current definition
Postpone: Append the compilation semantics stated above to the current definition
** TRANSLATE-STRING** ( addr1 u1 -- addr1 u1 | ) RECOGNIZER EXT
Translate the string:
Interpretation: keep the string on the stack
Compilation: Append the run-time defined in SLITERAL
to the current definition
Postpone: Append the compilation semantics stated above to the current definition
?SCAN-STRING ( addr1 u1 scan-translate-string string-rest<"> -- addr2 u2 translate-string | ... translator -- ... translator ) RECOGNIZER
If the recognized token is an incompleted string, complete the scanning as
defined for SCAN-TRANSLATE-STRING
and replace the translator with the xt of
TRANSLATE-STRING
.
REC-STRING ( addr u -- addr u translate-string | 0/NOTFOUND ) RECOGNIZER EXT
Check if addr u starts with a quote, and return that string and the xt of
SCAN-TRANSLATE-STRING
if it does, 0/NOTFOUND otherwise.
[IF] Optional API for direct access of translator states
INTERPRETING ( j*x xt -- k*x ) RECOGNIZER EXT
Execute xt-int of the translator xt. If xt is not a translator, do -21 THROW
, or a best-effort attempt to execute xt in interpreting state.
COMPILING ( j*x xt -- ) RECOGNIZER EXT
Execute xt-comp of the translator xt. If xt is not a translator, do -21 THROW
, or a best-effort attempt to execute xt in compiling state.
POSTPONING ( j*x xt -- ) RECOGNIZER EXT
Execute xt-post of the translator xt. If xt is not a translator, do -21 THROW
, or a best-effort attempt to execute xt in postponing state.
GET-STATE ( -- xt ) RECOGNIZER EXT
Obtain the operation xt performed when translating.
SET-STATE ( xt -- ) RECOGNIZER EXT
Makes xt the operation performed when translating. If xt is not related to
' INTERPRETING
, ' COMPILING
, or ' POSTPONING
, do -12 THROW
.
[THEN] optional API for direct access of translator states
]] ( -- ) RECOGNIZER EXT
Interpretation semantics: undefined
Compilation semantics: Set the system into postpone state. The interpreter
will then perform post-xt of all translators found. Compilation state
resumes when [[
is recognized. This word may change STATE
and the
recognizer sequence to reflect the change of this state.
[[ ( -- ) RECOGNIZER EXT
Interpretation semantics: undefined
Compilaton semantics: undefined
Postpone semantics: enter compilation state, see ]
; all changes to STATE
and recognizer sequence done by ]]
are reverted.
Note: [[
needs special treatment in postpone mode, so it might also use a
non-standard translator and be not a word at all.
STATE ( -- addr ) RECOGNIZER
If ]]
uses STATE
to store postpone state, extends the semantics of
6.1.2250 by adding a second non-zero value. ]]
enters this state, and [[
leaves it. Only translators and the code responsible for displaying the
prompt can see this third state, as all other words are postponed in this
state.
Reference implementation:
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish.
This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix. This implementation does only take interpret and compile state into account, and uses the STATE variable to distinguish. It uses NOTFOUND=0.
Defer forth-recognize ( addr u -- i*x translator-xt / notfound )
Defer forth-recognize ( addr u -- i*x translator-xt / 0 )
: ?found ( translator -- translator | 0 -- never )
dup 0= IF -13 throw THEN ;
: interpret ( i*x -- j*x )
BEGIN
?stack parse-name dup WHILE
forth-recognize execute
parse-name dup WHILE
forth-recognize ?found execute
REPEAT ;
: lit, ( n -- ) postpone literal ;
: notfound ( state -- ) -13 throw ;
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , ,
does> state @ 2 + cells + @ execute ;
An alternative implementation for TRANSLATE:
can use a deferred word:
Defer do-translate
: translate: ( xt-interpret xt-compile xt-postpone "name" -- )
create , , , does> do-translate ;
: set-state ( xt -- ) dup is do-translate >body @ 2 - state ! ;
: get-state ( -- xt ) action-of do-translate ;
Extensions reference implementation:
: ]] -2 state ! ; immediate
: [[ -1 state ! ; immediate
:noname name>interpret execute ;
:noname name>compile execute ;
:noname name>compile swap lit, compile, ;
:noname dup name>interpret ['] [[ =
IF name>interpret execute \ special case
ELSE name>compile swap lit, compile, THEN ;
translate: translate-nt ( nt -- )
: lit, ( n -- ) postpone literal ;
' noop
' lit,
:noname lit, postpone lit, ;
translate: translate-num ( n -- )
: rec-nt ( addr u -- nt nt-translator / notfound )
forth-wordlist find-name-in dup IF ['] translate-nt ELSE drop ['] notfound THEN ;
: rec-num ( addr u -- n num-translator / notfound )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop ['] notfound THEN ;
: rec-nt ( addr u -- nt nt-translator | 0 )
forth-wordlist find-name-in dup IF ['] translate-nt THEN ;
: rec-num ( addr u -- n num-translator | 0 )
0. 2swap >number 0= IF 2drop ['] translate-num ELSE 2drop drop 0 THEN ;
: minimal-recognize ( addr u -- nt nt-translator / n num-translator / notfound )
: minimal-recognize ( addr u -- nt nt-translator | n num-translator | 0 )
2>r 2r@ rec-nt dup ['] notfound = IF drop 2r@ rec-num THEN 2rdrop ;
' minimal-recognizer is forth-recognize
Extensions reference implementation:
: translate-method: ( n -- )
Create , DOES> @ cells + >body @ execute ;
0 translate-method: postponing
1 translate-method: compiling
2 translate-method: interpreting
: set-state ( xt -- )
>body @ 2 - state ! ;
: get-state ( -- xt )
case state @
0 of ['] interpreting endof
-1 of ['] compiling endof
-2 of ['] postponing endof
-11 throw
endcase ;
: postpone ( "name" -- )
parse-name forth-recognize ?found postponing ; immediate
This reference implementation uses a table dispatch only. Note that this can give surprising results when you directly apply a particular state, and one of the words executed (translator or nt/xt found) is a state-smart word. If you want to use combined translators, like
: translate-dnum ( d -- ) >r translate-num r> translate-num ;
you can't do it like this. Neither does this work if you execute state-smart
words, as they expect STATE
to be set accordingly. Instead, you'll use
something like
: translate-method: ( n -- ) Create , DOES> @ dup state @ = IF drop execute EXIT THEN state @ >r state ! execute r> state ! ;
This will definitely work for combined literal translators, because those don't change state anyways.
This will also work for POSTPONE
, because apart from the tranlator, no word
is actually executed in one-shot POSTPONE
, and therefore, no state change is
possible.
This will also work for [
and ]
(and words using them) while interpreting
and compiling, because if you are already in the state from which the state is
changed away, you will not restore the state. If you are in the state this
will change to, this will work, too, because the state is restored after
EXECUTE
. This will not work if you are interpreting, and you do a s" ]]" forth-recognize ?found compiling
, because that transitions to postponing, and
then is reverted to interpreting.
[IF] setter and getter
: set-forth-recognize ( xt -- )
is forth-recognize ;
: forth-recognizer ( -- xt )
action-of forth-recognize ;
[THEN] setter and getter
Stack library
: STACK: ( size "name" -- )
CREATE 0 , CELLS ALLOT ;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
Recognizer sequences
: recognize ( addr len rec-seq-id -- i*x translator-xt | NOTFOUND )
: recognize ( addr len rec-seq-id -- i*x translator-xt | 0 )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP ['] NOTFOUND <> IF
EXECUTE DUP IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP ['] NOTFOUND
DROP 2DROP R> DROP 0
;
#10 Constant min-sequence#
: recognizer-sequence: ( rec1 .. recn n "name" -- )
min-sequence# stack: min-sequence# 1+ cells negate here + set-stack
DOES> recognize ;
: ?defer@ ( xt1 -- xt2 )
BEGIN dup is-defer? WHILE defer@ REPEAT ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- ) ?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n ) ?defer@ >body get-stack ;
: set-recognizer-sequence ( rec1 .. recn n rec-seq-xt -- )
?defer@ >body set-stack ;
: get-recognizer-sequence ( rec-seq-xt -- rec1 .. recn n )
?defer@ >body get-stack ;
Once you have recognizer sequences, you shall define
Once you have recognizer sequences, define
' rec-num ' rec-nt 2 recognizer-sequence: default-recognize
' default-recognize is forth-recognize
The recognizer stack looks surprisingly similar to the search order stack, and Gforth uses a recognizer stack to implement the search order. In order to do so, you define wordlists in a way that a wid is an execution token which searches the wordlist and returns the appropriate translator.
: find-name-in ( addr u wid -- nt / 0 )
execute ['] notfound = IF 0 THEN ;
execute dup IF drop THEN ;
root-wordlist forth-wordlist dup 3 recognizer-sequence: search-order
: find-name ( addr u -- nt / 0 )
['] search-order find-name-in ;
: get-order ( -- wid1 .. widn n )
['] search-order get-recognizer-sequence ;
: set-order ( wid1 .. widn n -- )
['] search-order set-recognizer-sequence ;
Recognizer examples
REC-NT ( addr u -- nt translate-nt | notfound ) Search the locals wordlist if locals have been defined, and then the search order for a definition matching the string addr u, and provide that name token as result.
Apart from the standardized recognizers above, here are some more examples of recognizers:
REC-NUM ( addr u -- n translate-num | d translate-dnum | notfound ) Try converting addr u into a number, and on success return either a single number n and translate-num, or a double number d and translate-dnum.
REC-TICK ( addr u -- xt translate-num | 0/NOTFOUND ) If addr u starts with a ``` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
REC-FLOAT ( addr u -- r translate-float | notfound ) Try converting addr u into a floating point number, and on success return that number r and translate-float.
REC-SCOPE ( addr u -- nt translate-nt | 0/NOTFOUND ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
REC-STRING ( addr u "string"<"> -- addrs us translate-string | notfound "string"<"> ) Convert quoted strings (i.e. addr u starts with '"') in the input stream into string literals, performing the same escape handling as S\" and on success return the converted string as addrs us and translate-string.
REC-TO ( addr u -- xt n translate-to | 0/NOTFOUND ) Handle the following syntax of TO
-like operations of value-like words:
REC-TICK ( addr u -- xt translate-num | notfound ) If addr u starts with a ````` (backtick), search the search order for the name specified by the rest of the string, and if found, return its xt and translate-num.
->
name asTO
name=>
name asIS
name+>
name as+TO
name'>
name asADDR
name@>
name asACTION-OF
name
REC-SCOPE ( addr u -- nt translate-nt | notfound ) Search for words in specified vocabularies (the vocabulary needs to be found in the current search order), the string addr u has the form vocabulary:
name, otherwise than that this specifies the vocabulary to be searched in, REC-SCOPE
is identical in effect to REC-NT
.
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
REC-TO ( addr u -- xt n translate-to | notfound ) Handle the following syntax of TO
-like operations of value-like words:
REC-ENV ( addr u -- addr1 u1 translate-env | 0/NOTFOUND ) Takes a pattern in the form of ${
name}
and provides the name as addr1 u1 on the stack. The corresponding translator TRANSLATE-ENV
is responsible for looking up that name in the operating system's environment variable array, or compiling appropriate code to do so.
->
value asTO
value orIS
value+>
value as+TO
value'>
value asADDR
value@>
value asACTION-OF
value
REC-COMPLEX ( addr u -- rr ri translate-complex | 0/NOTFOUND ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns the xt of TRANSLATE-COMPLEX
on success.
xt is the execution token of the value found, n indexes which variant of a TO
-like operation is meant, and translate-to is the corresponding translator.
Testing
REC-ENV ( addr u -- addrs us translate-env | notfound ) Takes a pattern in the form of ${
name}
and provides the name as addrs us on the stack. The corresponding translator translate-env is responsible for looking up that name in the operating system's environment variable array.
T{ 0 recognizer-sequence: RS -> }T
REC-COMPLEX ( addr u -- rr ri translate-complex | notfound ) Converts a pair of floating point numbers in the form of float1+
float2i
into a complex number on the stack, and returns translate-complex on success.
T{ :noname 1 ; :noname 2 ; :noname 3 ; translate: translate-1 -> }T T{ :noname 10 ; :noname 20 ; :noname 30 ; translate: translate-2 -> }T
Testing
\ really stupid: 1 character length or 2 characters T{ : rec-1 NIP 1 = IF ['] translate-1 ELSE 0 THEN ; -> }T T{ : rec-2 NIP 2 = IF ['] translate-2 ELSE 0 THEN ; -> }T
TBD
T{ ' translate-1 interpreting -> 1 }T T{ ' translate-1 compiling -> 2 }T T{ ' translate-1 postponing -> 3 }T
\ set and get methods T{ 0 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> 0 }T
T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> ' rec-1 1 }T
T{ ' rec-1 ' rec-2 2 ' RS set-recognizer-sequence -> }T T{ ' RS get-recognizer-sequence -> ' rec-1 ' rec-2 2 }T
\ testing RECOGNIZE T{ 0 ' RS set-recognizer-sequence -> }T T{ S" 1" RS -> 0 }T T{ ' rec-1 1 ' RS set-recognizer-sequence -> }T T{ S" 1" RS -> ' translate-1 }T T{ S" 10" RS -> 0 }T T{ ' rec-2 ' rec-1 2 ' RS set-recognizer-sequence -> }T T{ S" 10" RS -> ' translate-2 }T