Proposal: Recognizer committee proposal 2025-09-11
This page is dedicated to discussing this specific proposal
ContributeContributions
AntonErtl
[412] Recognizer committee proposal 2025-09-11Proposal2025-09-12 03:56:07
The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.
Translations and text-interpretation
Recognizers produce translations. The text interpreter (and other
users, such as postpone), removes the translation from the stack(s),
and then either performs the interpreting run-time, compiling
run-time, or postponing run-time.
Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.
Types
translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.
translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)
Words
rec-name ( c-addr u -- translation )
If c-addr u is the name of a visible local or a visible named word,
translation represents the text-interpretation semantics
(interpreting, compiling, postponing) of that word (see
translate-name). (formerly called rec-nt). If not, translation is
translate-none.
rec-number ( c-addr u -- translation )
If c-addr u is a single or double number (without or with prefix), or
a character, all as described in section ..., translation represents
pushing that number at run-time (see translate-cell,
translate-dcell). If not, translation is translate-none.
rec-float ( c-addr u -- translation )
If c-addr u is a floating-point number, as described in section ...,
translation represents pushing that number at run-time (see translate-float). If c-addr u
has the syntax of a double number without prefix according to section
..., and it correspond to the floating-point number r corresponding to
that string according to section ..., translation may represent
pushing r at run-time. If c-addr u is not recognized as a
floating-point number, translation is translate-none.
rec-none ( c-addr u -- translation )
This word does not recognize anything. For its translation, see
translate-none. (formerly known as notfound and r:fail)
recs ( -- )
Print the recognizers in the recognizer sequence in rec-forth, the
first searched recognizer leftmost. (formerly known as .recognizers)
rec-forth ( c-addr u -- translation )
This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)
rec-sequence: ( xtu .. xt1 u "name" -- )
Define a recognizer sequence "name" containing u recognizers
represented by their xts. If set-recs is implemented, the sequence
must be able to accomodate at least 16 recognizers.
name execution: ( c-addr u -- translation )
Execute xt1; if the resulting translation is the result of
translate-none, restore the data stack to ( c-addr u -- ) and try
the next xt. If there is no next xt, remove ( c-addr u -- ) and
perform translate-none.
translate-none ( -- translation )
(formerly r:fail or notfound)
translation interpreting run-time: ( ... -- )
-13 throw
translation compiling run-time: ( ... -- )
-13 throw
translation postponing run-time: ( ... -- )
-13 throw
translate-cell ( x -- translation )
(formerly translate-num)
translation interpreting run-time: ( -- x )
translate-dcell ( xd -- translation )
(formerly translate-dnum)
translation interpreting run-time: ( -- xd )
translate-float ( r -- translation )
translation interpreting run-time: ( -- r )
translate-name ( nt -- translation )
(formerly translate-nt)
translation interpreting run-time: ( ... -- ... )
Perform the interpretation semantics of nt.
translation compiling run-time: ( ... -- ... )
Perform the compilation semantics of nt.
translate: ( xt-int xt-comp xt-post "name" -- )
Define "name" (formerly rectype:)
"name" exection: ( i*x -- translation )
translation interpreting run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-int.
translation compiling run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-comp.
translation postponing run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-post.
get-recs ( xt -- xt_u ... xt_1 u )
xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.
set-recs ( xt_u ... xt_1 u xt -- )
xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.
Rationale
(This will also be fleshed out)
The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.
However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:
interpreting ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.
compiling ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.
postponing ( ... translation -- )
For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.
Examples
s" 123" rec-forth ( translation ) interpreting
: rec-tick ( addr u -- translation )
over c@ '`' = if
1 /string find-name dup if
name>interpret ( xt ) translate-num then
exit then
\ 2drop notfound
rec-none ;
' noop ( x -- x ) \ int-xt
' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt
translate: translate-num
AntonErtl
New Version: Recognizer committee proposal 2025-09-11
The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.
Translations and text-interpretation
Recognizers produce translations. The text interpreter (and other
users, such as postpone), removes the translation from the stack(s),
and then either performs the interpreting run-time, compiling
run-time, or postponing run-time.
Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.
Types
translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.
translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)
Words
rec-name ( c-addr u -- translation )
If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see
translate-name). (formerly called rec-nt). If not, translation is
translate-none.
translate-name). If not, translation is translate-none.
(formerly called rec-nt)
rec-number ( c-addr u -- translation )
If c-addr u is a single or double number (without or with prefix), or
a character, all as described in section ..., translation represents
pushing that number at run-time (see translate-cell,
translate-dcell). If not, translation is translate-none.
rec-float ( c-addr u -- translation )
If c-addr u is a floating-point number, as described in section ...,
translation represents pushing that number at run-time (see translate-float). If c-addr u
has the syntax of a double number without prefix according to section
..., and it correspond to the floating-point number r corresponding to
that string according to section ..., translation may represent
pushing r at run-time. If c-addr u is not recognized as a
floating-point number, translation is translate-none.
translation represents pushing that number at run-time (see
translate-float). If c-addr u has the syntax of a double number
without prefix according to section ..., and it correspond to the
floating-point number r corresponding to that string according to
section ..., translation may represent pushing r at run-time. If
c-addr u is not recognized as a floating-point number, translation is
translate-none.
rec-none ( c-addr u -- translation )
This word does not recognize anything. For its translation, see
translate-none. (formerly known as notfound and r:fail)
translate-none.
recs ( -- )
Print the recognizers in the recognizer sequence in rec-forth, the
first searched recognizer leftmost. (formerly known as .recognizers)
rec-forth ( c-addr u -- translation )
This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)
rec-sequence: ( xtu .. xt1 u "name" -- )
Define a recognizer sequence "name" containing u recognizers
represented by their xts. If set-recs is implemented, the sequence
must be able to accomodate at least 16 recognizers.
name execution: ( c-addr u -- translation )
Execute xt1; if the resulting translation is the result of
translate-none, restore the data stack to ( c-addr u -- ) and try
the next xt. If there is no next xt, remove ( c-addr u -- ) and
perform translate-none.
translate-none ( -- translation )
(formerly r:fail or notfound)
translation interpreting run-time: ( ... -- )
-13 throw
translation compiling run-time: ( ... -- )
-13 throw
translation postponing run-time: ( ... -- )
-13 throw
translate-cell ( x -- translation )
(formerly translate-num)
translation interpreting run-time: ( -- x )
translate-dcell ( xd -- translation )
(formerly translate-dnum)
translation interpreting run-time: ( -- xd )
translate-float ( r -- translation )
translation interpreting run-time: ( -- r )
translate-name ( nt -- translation )
(formerly translate-nt)
translation interpreting run-time: ( ... -- ... )
Perform the interpretation semantics of nt.
translation compiling run-time: ( ... -- ... )
Perform the compilation semantics of nt.
translate: ( xt-int xt-comp xt-post "name" -- )
Define "name" (formerly rectype:)
"name" exection: ( i*x -- translation )
translation interpreting run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-int.
translation compiling run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-comp.
translation postponing run-time: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-post.
get-recs ( xt -- xt_u ... xt_1 u )
xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.
set-recs ( xt_u ... xt_1 u xt -- )
xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.
Rationale
(This will also be fleshed out)
The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.
However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:
interpreting ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.
compiling ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.
postponing ( ... translation -- )
For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.
Examples
s" 123" rec-forth ( translation ) interpreting
: rec-tick ( addr u -- translation )
over c@ '`' = if
1 /string find-name dup if
name>interpret ( xt ) translate-num then
exit then
\ 2drop notfound
rec-none ;
' noop ( x -- x ) \ int-xt
' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt
translate: translate-num
EricBlake
I'm aware you intend to flush this out further, but when doing so, the following observations may be useful.
Examples
s" 123" rec-forth ( translation ) interpreting
It might be helpful to also show the stack effect after interpreting, as in:
s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack
If I understand the intent correctly, the difference between rec-none and translate-none is that both produce the same translation (which consists of a translation token and possibly additional cells), while only the former also consumes addr u before pushing that translation. Taking it further, since translation is a semi-opaque type of one or more cells, I think I can still implement it where the translation token returned by translate-none is a literal 0 (trivially as 0 constant translate-none), provided that my interpreting/compiling/postponing all recognize a literal 0 as the built-in translation whose effects are to result in -13 throw. Or I could implement :noname -13 throw ; dup dup translate: translate-none where the translation token is a non-zero xt just like any other user-defined word created by translate:, and then interpreting/compiling/postponing don't have to special-case 0. But either way, my implementation choice for translate-none is not unduly constrained by the standard, and not relevant or visible to the user; but it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.
But given that analysis, It means that the standard should not prohibit a translation token from being a literal 0; when compared to the other recent work in r1533 to designate that xt => x \ flag, the standard should be clear that translation token => x and not translation token => xt.
Continuing with the examples,
: rec-tick ( addr u -- translation ) over c@ '`' = if 1 /string find-name dup if name>interpret ( xt ) translate-num then
Why is this calling out translate-num instead of translate-cell?
exit then
This exit leaves the stack with either xt translate-num or 0; the former makes sense but the latter assumes translate-none produces a literal 0, which I just argued above is a specific implementation choice rather than something the standard mandates. Would it be better as:
... find-name ?dup if name>interpret ( xt ) translate-cell exit then translate-none exit ...
\ 2drop notfound rec-none ;
I like how rec-none serves the same role as the former 2drop notfound, but does leaving in the comment aid in understanding the example?
' noop ( x -- x ) \ int-xt ' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt :noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt translate: translate-num
Is this part of the example intended to be a potential reference implementation of translate-cell? Or is it intended to supply the translate-num used in the rec-tick above, in which case the order of presentation should be swapped?
ruv
translate-cell ( x -- translation )translate-name ( nt -- translation )
The naming scheme translate-*** is inappropriate and confusing for these words; for example, the name translate-name implies that the word performs some translation, but this word actually does not perform any translation, it is just a constant (i.e., it simply pushes a single-cell token on the stack; and this should be indicated in its stack diagram).
We should find a better naming scheme for these words.
Possible options:
***-recognized***-tagortag-***(because effectively this value is a data type tag for a data object)td-***(from "token discriminator" or "token descriptor", similar to tag)
Other?
@EricBlake wrote:
I think I can still implement it where the translation token returned by
translate-noneis a literal0
Yes, for example, in Gforth it is currently implemented this way.
Zero value on unsuccess simplifies analyzing — you can do «dup if ...» or «if ...» instead of «dup tag-none <> if ... ». If most implementations stick to this approach, it can be standardized.
Why is this calling out
translate-numinstead oftranslate-cell?
This is probably an oversight after renamings.
EricBlake
it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.
But now in trying to code something up that uses recognizers from the user's point of view, I tried to write a quick word that accepts single-cell numbers but rejects character strings that would produce a double cell or not be recognized as a number. Using gforth's implementation, it might be as simple as:
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
rec-number
case
translate-cell of ( n ) true exit endof
translate-none of ( -- ) 0 exit endof
translate-dcell of ( d ) 2drop 0 exit endof
abort" unexpected translation" endcase ;
But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation ), and my usage above did not satisfy that requirement. So, what if I modify it along these lines, to ensure that every time translate-cell is used, there is an n on the stack before-hand (and then jumping through hoops to get it back off the stack)?
: token-none ( -- token ) \ determine the token produced by translate-none
translate-none
;
: token-cell ( -- token ) \ determine the token produced by translate-cell
0 translate-cell dup >r interpreting drop r>
;
: token-dcell ( -- token ) \ determine the token produced by translate-dcell
#0. translate-dcell dup >r interpreting 2drop r>
;
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
rec-number
case
token-cell of ( n ) true exit endof
token-none of ( -- ) 0 exit endof
token-dcell of ( d ) 2drop 0 exit endof
abort" unexpected translation" endcase ;
Alas, even that seems like it is not portable to the proposed spec, but has an environmental dependency on gforth's implementation. There's nothing in the proposed wording that requires the translation token cell to be identical regardless of what the rest of the overall translation represents. In fact, I see nothing that requires translate-none to produce a single cell, nor for any other given flavor of translation to occupy a consistent number of cells regardless of the value being translated. The proposal is very clear that 123 translate-cell interpreting will leave 123 on the stack, but intentionally does not state how many cells of the stack are in use in between translate-cell and interpreting (only that it is a semi-opaque type of one or more cells).
Put differently, gforth's implementation for translate-cell happens to be idempotent and produce a translation that occupies two cells (namely, the value of the cell being translated, and a single-cell translation token); but based on just the proposed spec, what would prevent an alternative implementation that has a translation occupy exactly one cell (namely, a pointer to an internal struct that wraps multiple pieces of information, including the value to push, the current line/offset of the source at the time the call to translate-cell was made in order to make for more friendly SEE output, and so on). With such an implementation, I could argue that 0 translate-cell 0 translate-cell = producing false is acceptable, because it results in two different pointers (there were two different source locations at the time of the two different invocations of translate-cell). Or what would prevent an implementation where 0 translate-cell occupies one cell, because it is a frequently-encountered and worth special-casing in the interpreter loop, vs. 123 translate-cell occupying two cells, because it is infrequently encountered?
From the user's point of view, it would be a lot more powerful if we had a guarantee that a given translate-XXX produces the same translation token at the top of the stack (even if the rest of the stack is variable-length), and that a given rec-XXX produces idempotent output (maybe with limitations on how SOURCE and >IN can be changed between the recognizer and the action on the translation). It would also be nice if we could guarantee that ALL instances of translate-XXX have the behavior of pushing a single cell to the stack, where that cell is constant for a given translate-XXX, and document that comparing translation tokens is well-defined, and that the rest of the stack diagram for a translator only matters if the resulting translation will be further passed to interpreting/compiling/postponing (ie. translate-cell drop is always unambiguous and cannot cause stack underflow, but it is ambiguous behavior to attempt translate-cell compiling if there was not an n on the stack).
Finally, it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- ), so that I could rewrite my earlier example more compactly:
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
rec-number
dup translate-cell = if ( n token-cell ) drop true exit then
( 0 | d token-dcell -- ) discard 0
ruv
But re-reading the proposed specification, translate-cell has a mandated stack effect of
( n -- translation )
The idea is that translate-cell has stack effect ( -- x.td ), and ( x x.td ) is a translation. The name translate-cell with its stack diagram is very confusing. I think this proposal needed more work before publication.
See also my suggestion for naming these data types.
ruv
it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe
discard ( translation -- )
Yes, it would be useful and convenient. But then we should associate more information with data type identifiers. I think I would implement this.
In this proposal, "translation" (yes, the name is inappropriate) in stack diagrams is a data type symbol for the data type that is a subtype of ( ut td ), which is a pair of ut (unqualified token) and td (token descriptor) at the top, where:
td => x, i.e., the token descriptor (a data type) is a subtype of the unspecified cell;ut => ( F: j*r ; S: i*x ; ), i.e., the unqualified token (a data type) is an arbitrary tuple (possibly empty) of r and x values, so that r values reside on the floating point stack, x values reside on the data stack.
In the language of type theories, ( ut td ) is a dependent pair type, since each member of td is associated with some particular subtype of uq. In other words, the value of the stack parameter td determines the type of the stack parameter ut. Therefore, a member of the type ( ut td ) can be interpreted as a tagged data object, in which the value td is a tag for the value ut.
Thus, each member of td is also an identifier of some data type. The Forth system associates with this identifier information about how to translate (interpret or compiler) the members of the corresponding data type (a subtype of ut). Of course, this identifier may also be associated with information about how to remove the members of this data type from the stacks.
Since the user can effectively define own data types, we should provide a way to create a token descriptor (a members of td) and associate various information with it in several steps. The main information piece is about translation of the data type members. Information about postponing and discarding (removing from the stacks) my be optional.
Regarding terminology/naming. We can use the term "data type identifier" instead of "token descriptor", but 1) this name is longer, 2) then there will be a number of terms that look very similar: "data type", "data type symbol", "data type identifier". Therefore, I would prefer more distinguishable terms.
As an option, instead of "token descriptor" we can use "type descriptor".
An alternative solution to remove a qualified token from the stack is to determine its size. This can be done by storing the stack depth before the qualified token is placed on the stack and then calculating the change in stack depth. For example, see the word apply-recognizer-filter in my recognizer/filter.fth and its use in the word available-xt in example.text-translator.fth.
AntonErtl
The mentions of translate-num in the examples are oversights and should be translate-cell. The way that the example rec-tick deals with the case where the word is not found does not work with a non-zero translate-none. A correct implementation is:
: rec-tick ( addr u -- translation ) \ gforth-experimental
over c@ '`' = if
1 /string find-name dup if
name>interpret translate-cell exit then
drop translate-none exit then
rec-none ;
EricBlake
: rec-tick ( addr u -- translation ) \ gforth-experimental over c@ '`' = if
The \ gforth-experimental comment can be dropped. Is a recognizer guaranteed that u will be non-zero, or is this c@ at risk of reading beyond the bounds of the input argument? And if the recognizer is called on the length-1 string "`", should this example be relying on the implementation-defined results of c-addr 0 find-name (likely 0, but possibly an xt if the implementation allows for an empty-length dictionary entry)?
ruv
I think, we should fix the following problems.
The term "translation"
The term translation is not suitable to denote the general type of recognizers result. Since "translation" is either an act of translating, or a product of translating (not recognizing). Even the term "recognition" is more suitable, if someone likes it.
Another possible option: "recognized", which will be used as a nominalized adjective (i.e., a noun).
We also need a separate term to denote the type of the topmost x value of a successful recognizing result.
The scheme translate-something
The naming scheme translate-something is not suitable for words that have type ( -- x ) and are constants.
- Effectively, any member of this naming scheme is a verb phrase; this scheme was intended for words that perform translation (interpretation or compilation), which is an active action with possible side effects.
- For example,
translate-nt ( i*x nt -- j*x ).
- For example,
A word that is a constant should have the name that is a noun or a noun phrase.
This naming scheme should be aligned with the corresponding general data type name/symbol.
The names get-recs and set-recs
The pair of words ( get-recs, set-recs ) is similar to the pair of standard words ( get-order, set-order ) by the form of their names, but very different conceptually, since they accept the object on the top. This is an inconsistency in naming conventions.
Better naming options are:
recs@andrecs!- "recs" in these names denotes the pair of types at once
- the type of a data object that is fetched or stored
- the type of a target data object
- it is also similar on some extend to the pairs of standard words (
defer@,defer!), (c@,c!), (2@,2!)- see also my post on ForthHub in this regard.
- "recs" in these names denotes the pair of types at once
fetch-recsandstore-recs
The names translate: and rec-sequence:
The corresponding words are proposed as defining words.
Traditionally, a colon was only used in the names of standard defining words that have a counterpart word with a semi-colon in the name.
So, this name is inconsistent with other names. Note that this tradition was broken bye new "*field:" words (but not +field).
- Can we avoid a colon in the defining words that don't have a counterpart word with a semicolon?
The name rec-sequence: is too close to rec-sequence that is a member of the rec-something naming scheme. This is inconsistent and confusing.
- A possible option:
recs— an abbreviation of "recognizers sequence", which is "sequence of recognizers".- Maybe it is better if if this word was like
wordlist, which produces a new identifier on the stack without creating a word.
- Maybe it is better if if this word was like
Josef
I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge. I utilized this proposal, BerndPaysan's retired recognizer proposal, FORTH Inc.'s recognizer page, and the comments here and on the mailing list.
Suggestions short summary
Remove the "translation" term because it's obfuscates the possible outputs, explained below.
Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.
A recognizer definition is proposed below.
It doesn't seem that the connection between a recognizer's pattern and the rest of the steps is really discussed. Matching a text token to the pattern is the first step. The parameters fetched according to the recognizer's pattern. The run-time action table is associated with a specific pattern parameter.
Recognizer term
From this proposal and FORTH, Inc.'s write up, the following seems to be how to design a recognizer:
- Determine the text pattern of the data.
- _E.g. complex numbers follow an "a+bi" pattern.
- Create a pattern matching algorithm for the text pattern.
- Determine pattern parameters to be fetched.
- Determine the run-time action tables to pair with the fetched pattern parameters.
Recognizer definiton proposal: A recognizer attempts to match a text token to a pattern. A successful text token match invokes fetching the pattern parameters and the associated run-time action table. A failed matching attempt outputs a rat-none . The text interpreter (and other users, such as postpone ), utilizes the run-time action table to perform either the interpreting run-time, compiling run-time, or postponing run-time.
make-rec-sequence ( xtu .. xt1 u "name" -- )rec-sequence:
rec-name ( c-addr u -- xt rat | rat-none)
rec-num ( c-addr u -- i*x rat | rat-none)
rec-none ( c-addr u -- rat-none )
I agree with @ruv's suggestions for get-recs and set-recs, i.e. recs@ & recs!.
Translation Term
Translation seems to hide information. The relationship between the pattern parameters and the run-time action table is fixed. Because different recognizers produce different outputs, using "translation" as a catchall obscures the output, rather than listing the output i*x rat , xt rat , etc.
translation token run-time action table: Single-cell item that contains the run-time actions associated with specific pattern parameters, i.e. interpreting run-time, compiling run-time, and postponing run-time. (This has formerly been called a rectype, translation token. It's a table of run-time actions.)
translate:make-rat ( xt-int xt-comp xt-post "name" -- )
translate-wordrat-word ( -- rat )
pattern parameters: is the optional set of data fetched after a successful text token match. The set is on various stacks below the run-time action table. (This could use a better name, not sure if it's really needed, but it helped my thinking.)
I walk through the examples below with the notes above.
Example: REC-NAME
FORTH, Inc. has this example:
' EXECUTE ' COMPILE, ' POSTPONE, TRANSLATE: TRANSLATE-WORD
' EXECUTE ' EXECUTE ' COMPILE, TRANSLATE: TRANSLATE-IMM
: REC-NAME ( c-addr len -- xt addr1 | addr2 )
(FIND) CASE
-1 OF TRANSLATE-WORD ENDOF
1 OF TRANSLATE-IMM ENDOF
0 OF TRANSLATE-NONE ENDOF
ENDCASE ;
Compared to the steps above:
- Data to be handled is "words in general".
- The pattern is a word is in the dictionary.
- The pattern parameters fetched could be:
xt 1xt -1cddr 0
- Pattern parameters are associated to rats as follows:
- 1 to
TRANSLATE-WORD - -1 to
TRANSLATE-IMM - 0 to
TRANSLATE-NONE(originally, NOTFOUND).
- 1 to
(FIND) completes both Steps 2 & 3. The rat output is based on the pattern parameters fetched, not the pattern being matched.
Example: REC-TICK
From the proposal:
: rec-tick ( addr u -- translation ) \ gforth-experimental
over c@ '`' = if
1 /string find-name dup if
name>interpret translate-cell exit then
drop translate-none exit then
rec-none ;
Walking through the steps:
- The data to be handled is a ticked word.
- The pattern is a name in the dictionary.
- The pattern parameters fetched by
1 /string find-namecould be:nt0
- Pattern parameters are associated to rats as follows:
nttotranslate-cell0totranslate-none
The rat output is based on the pattern parameters fetched, not the pattern being matched.
2drop translate-none seems clearer than rec-none. I keep getting caught looking at the rec-tick example thinking "what is rec-none recognizing?"
Example Observations
- Pattern matching and pattern parameter fetching can be combined or separate words.
- It would be reasonable to have a failed pattern parameter fetch be an error. A pitfall of creating recognizers is ensuring there is little to no overlap of patterns.
- E.g.
'bobis the name as defined, processed by rec-name.'stanis ticked version ofstanprocessed byrec-tick.
- E.g.
rec-nonecould be the final recognizer in recognizer sequences, exiting any further evaluation. Instead of creating a new sequence, one could moverec-noneearlier in the sequence.
Thank you for reading this far, hopefully there is more food for thought, than madness.
ruv
@Josef wrote:
I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge.
Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.
I'm making one more attempt on this matter.
The language of the Standard already uses concepts such as data object, data type, typed data object, and subtyping (see 3.1 Data types).
Using these concepts, we can describe a successful recognition result as a pair consisting of a data object and its corresponding data type.
On the stack, data types must be represented by specific identifiers,
similar to how semantics elements are represented by xt identifiers.
We might refer to such an identifier as a type descriptor (symbol td).
- Note: "type descriptor" is preferred over "type identifier" because, in the language of the Standard, we will need expressions like "type descriptor td identifies ...". Using "type identifier" would lead to awkward repetitions such as "type identifier ti identifies ...".
- Another option for this term could be "type token" (seems less preferable).
Additionally, we might define
a qualified data object (symbol qdo)
as a pair consisting of a data object
and the type descriptor that identifies that object's data type.
- Note. This concept should be distinguished from the existing concept of a "typed data object".
The elegance and strength of this approach lie in the following points:
- It builds upon existing terminology, with only slight extensions.
- It incorporates existing data type symbols into naming conventions.
- It leverages subtyping relationships between data types to reduce redundancy (adhering to the DRY principle).
Type descriptors can be used to:
- Translate data objects (into the body of a Forth definition when compiling or side effects when interpreting).
- Convert data objects to different data types (casting).
- E.g., getting xt from nt (for example, of an ordinary word only)
- Check subtyping relationships between data types (or of a qualified data object).
- Define new type descriptors.
These features can be designed independently of recognizers, and recognizers only rely on them when returning a qualified data object or analyzing a qualified data object from another recognizer.
AntonErtl
New Version: Recognizer committee proposal 2025-09-11
The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.
Recognizer committee proposal 2025-09-11
Translations and text-interpretation
The committee has found consensus on the words in this proposal. I was asked to write it up.
Recognizers produce translations. The text interpreter (and other
users, such as postpone), removes the translation from the stack(s),
and then either performs the interpreting run-time, compiling
run-time, or postponing run-time.
Author:
Unless otherwise specified the compiling run-time compiles the
M. Anton Ertl (based on previous work by Matthias Trute, Bernd Paysan, and others, and the input of the standardization committee).
Change Log:
- 2026-02-08 Fleshed out proposal; worked in feedback up to now.
- 2025-09-12 [r1535] Some fixes
- 2025-09-12 [412] Initial version
Problem:
The classical text interpreter is inflexible: E.g., adding floating-point recognizers requires hardcoding the change; several systems include system-specific hooks (sometimes more than one) for plugging in functionality at various places in the text interpreter.
The difficulty of adding to the text interpreter may also have led to
missed opportunities: E.g., for string literals the standard did not
task the text interpreter with recognizing them, but instead
introduced S" and S\" (and their complicated definition with
interpretation and compilation semantics).
Solution:
The recognizer proposal is a factorization of the central part of the text interpreter.
As before the text interpreter parses a white-space-delimited string.
Unlike before, the string is now passed to the recognizers in the
default recognizer sequence rec-forth, one recognizer after another,
until one matches. The result of the matching recognizer is a
translation, an on-stack representation of the word or literal. The
translation is then processed according to the text-interpreter's
state (interpreting, compiling, postponing).
There are five usage levels of recognizers and related recognizer words:
Programs that use the default recognizers. This is the common case, and is essentially like using the traditional hard-coded Forth text interpreter. You do not need to use recognizer words for this level, but you can inform yourself about the recognizers in the current default recognizer sequence with
recs. The default recognizer sequence contains at leastrec-nameandrec-number, and, if the Floating-Point wordset is present,rec-float. Moreover, programmers can nowpostponenumbers and other recognized things.Programs that change which of the existing recognizers are used and in what order. The default recognizer sequence is
rec-forth. You can get the recognizers in it withget-recsand set them withset-recs. You can also create a recognizer sequence (which is a recognizer itself) withrec-sequence:. This proposal contains pre-defined recognizersrec-name rec-number rec-float rec-none, which can be used withset-recsor for defining a recognizer sequence.Programs that define new recognizers that use existing translation tokens. New recognizers are usually colon definitions, proposed-standard translation tokens are
translate-none translate-cell translate-dcell translate-float translate-name.Programs that define new translation tokens. New translation tokens are defined with
translate:.Programs that define text interpreters and programming tools that have to deal with recognizers. Words for achieving that are not defined in this proposal, but discussed in the rationale.
See the rationale for more detail and answers to specific questions.
Reference implementation:
TBD.
Testing:
TBD.
Proposal:
Usage requirements:
Data Types
translation: The result of a recognizer; the input of
interpreting, compiling, and postponing; it's a semi-opaque type
that consists of a translation token at the top of the data stack and
additional data on various stacks below.
translation token: (This has formerly been called a rectype.) Single-cell item that identifies a certain translation.
Translations and text-interpretation
A recognizer pushes a translation on the stack(s). The text interpreter
(and other users, such as postpone) removes the translation from the
stack(s), and then either performs the interpreting run-time,
compiling run-time, or postponing run-time.
All the proposed-standard translate-... words only push a
translation token. Their stack effects are specified as expecting
some data on the stack and pushing a translation. This shows what
data is required in addition to the translation token to form a
complete translation. A proposed-standard translate-... word pushes
the same translation token every time it is invoked.
Compiling and postponing run-time
Unless otherwise specified, the compiling run-time compiles the
interpreting run-time. The postponing run-time compiles the compiling run-time.
Exceptions
Types
Add the following exception to table 9.1:
translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.
-80 too many recognizers
translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)
Words
Words
rec-name ( c-addr u -- translation )
rec-name ( c-addr u -- translation )
(formerly rec-nt)
If c-addr u is the name of a visible local or a visible named word,
translation represents the text-interpretation semantics
(interpreting, compiling, postponing) of that word (see
translate-name). If not, translation is translate-none.
(formerly called rec-nt)
rec-number ( c-addr u -- translation )
rec-number ( c-addr u -- translation )
(formerly rec-num)
If c-addr u is a single or double number (without or with prefix), or
a character, all as described in section ..., translation represents
pushing that number at run-time (see translate-cell,
translate-dcell). If not, translation is translate-none.
a character, all as described in section 3.4.1.3 (Text interpreter
input number conversion), translation represents pushing that number
at run-time (see translate-cell, translate-dcell). If not,
translation is translate-none.
rec-float ( c-addr u -- translation )
rec-float ( c-addr u -- translation )
If c-addr u is a floating-point number, as described in section ...,
translation represents pushing that number at run-time (see
translate-float). If c-addr u has the syntax of a double number
without prefix according to section ..., and it correspond to the
floating-point number r corresponding to that string according to
section ..., translation may represent pushing r at run-time. If
If c-addr u is a floating-point number, as described in section 12.3.7
(Text interpreter input number conversion), translation represents
pushing that number at run-time (see translate-float). If c-addr u
has the syntax of a double number without prefix according to section
8.3.1 (Text interpreter input number conversion), and it corresponds
to the floating-point number r according to section 12.6.1.0558
(>float), translation may represent pushing r at run-time. If
c-addr u is not recognized as a floating-point number, translation is
translate-none.
rec-none ( c-addr u -- translation )
rec-none ( c-addr u -- translation )
This word does not recognize anything. For its translation, see
translate-none.
recs ( -- )
recs ( -- )
(formerly .recognizers)
Print the recognizers in the recognizer sequence in rec-forth, the
first searched recognizer leftmost. (formerly known as .recognizers)
first searched recognizer leftmost.
rec-forth ( c-addr u -- translation )
rec-forth ( c-addr u -- translation )
This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)
(formerly forth-recognize) This is a deferred word that contains the
recognizer (sequence) that is used by the Forth text interpreter.
rec-sequence: ( xtu .. xt1 u "name" -- )
rec-sequence: ( xtu .. xt1 u "name" -- )
Define a recognizer sequence "name" containing u recognizers
represented by their xts. If set-recs is implemented, the sequence
must be able to accomodate at least 16 recognizers.
name execution: ( c-addr u -- translation )
Execute xt1; if the resulting translation is the result of
translate-none, restore the data stack to ( c-addr u -- ) and try
the next xt. If there is no next xt, remove ( c-addr u -- ) and
perform translate-none.
translate-none ( -- translation )
translate-none ( -- translation )
(formerly r:fail or notfound)
(formerly r:fail or notfound)
translation interpreting run-time: ( ... -- )
-13 throw
translation compiling run-time: ( ... -- )
-13 throw
translation postponing run-time: ( ... -- )
-13 throw
translate-cell ( x -- translation )
translate-cell ( x -- translation )
(formerly translate-num)
(formerly translate-num)
translation interpreting run-time: ( -- x )
translate-dcell ( xd -- translation )
translate-dcell ( xd -- translation )
(formerly translate-dnum)
(formerly translate-dnum)
translation interpreting run-time: ( -- xd )
translate-float ( r -- translation )
translate-float ( r -- translation )
translation interpreting run-time: ( -- r )
translate-name ( nt -- translation )
translate-name ( nt -- translation )
(formerly translate-nt)
translation interpreting run-time: ( ... -- ... )
Perform the interpretation semantics of nt.
translation compiling run-time: ( ... -- ... )
Perform the compilation semantics of nt.
translate: ( xt-int xt-comp xt-post "name" -- )
translate: ( xt-int xt-comp xt-post "name" -- )
Define "name" (formerly rectype:)
(formerly rectype:)
Define "name"
"name" exection: ( i*x -- translation )
translation interpreting run-time: ( ... translation -- ... )
"name" interpreting action: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-int.
translation compiling run-time: ( ... translation -- ... )
"name" compiling action: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-comp.
translation postponing run-time: ( ... translation -- ... )
"name" postponing action: ( translation -- )
Remove the top of stack (the translation token) and execute xt-post.
get-recs ( xt -- xt_u ... xt_1 u )
get-recs ( xt -- xt_u ... xt_1 u )
xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.
set-recs ( xt_u ... xt_1 u xt -- )
set-recs ( xt_u ... xt_1 u xt -- )
xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched
first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.
first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.
postpone
Interpretation:
Interpretation semantics for this word are undefined.
Compilation: ( "<spaces>name" -- )
Skip leading space delimiters. Parse name delimited by a space.
Use rec-forth to recognize name, resulting in translation with
translation-token. For a system-defined translation token, first
consume the translation, then compile the 'compiling' run-time. For a
user-defined translation token, remove it from the stack and execute
its post-xt.
Rationale
(This will also be fleshed out)
Names
The names of terms and the proposed Forth words in this proposal have been arrived at after several lengthy discussions in the committee. Experience tells me that many readers (including from the committee) will take issue with one or the other name, but any suggestion for changing names will be ignored by the me. If you want them changed, petition the committee (but I hope they will be as weary of renamings as I am).
In particular, I suggested to use "recognized" instead of
"translation", and IIRC also to rename the translate-... words
accordingly, but the committee eventually decided to stay with
translation and translate-....
Face it: The names are good enough. Any renaming, even if it results in a better name, increases the confusion more than it helps: even committee members (culprits in the renaming game themselves) have complained about being confused by the new, possibly better names for concepts and words that have already been present in Matthias Trute's proposal.
If you want to improve the proposal, please read it, play with the words in Gforth, read the reference implementation and the tests when they arrive, and point out any mistake or lack of clarity.
Translation tokens and translate-... words
[r1541]
points out interesting uses of knowledge about translation tokens,
and, conflictingly, potential implementation variations. This
proposal decides against the implementation variations and for the
uses by specifying in the Usage Requirements that a translate-...
word just pushes a translation token, and it always pushes the same
one.
Discarding a translation
[r1541]
also asks for a way to discard a translation. This need has also come
up in some recognizers implemented in Gforth (e.g., rec-tick), and
Gforth uses (non-standard) words like sp@ and sp! for that.
Standard options would be to wrap the word that pushes a translation
into catch and discard the stacks with a non-zero throw, or to use
depth and fdepth in combination with loops of drop and fdrop;
both ways are cumbersome. My feeling is that many in the committee
and in the wider Forth community do not see the need for
discard-translation yet; this may change in the future.
Consumers of translations (Usage level 5)
The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text
interpreters, and such words are only used internally in the text interpreter.
interpreters, so such words are only used internally in the text interpreter, eliminating the need to standardize them.
However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:
interpreting ( ... translation -- ... )
interpreting ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.
compiling ( ... translation -- ... )
compiling ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.
postponing ( ... translation -- )
postponing ( ... translation -- )
For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.
Examples
Typical use:
s" 123" rec-forth ( translation ) interpreting
s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack
: umin ( u1 u2 -- u ) 2dup u< if drop else nip then ;
: string-prefix? ( c-addr1 u1 c-addr2 u2 -- f ) tuck 2>r umin 2r> compare 0= ;
: rec-tick ( addr u -- translation )
over c@ '`' = if
2dup "`" string-prefix? if
1 /string find-name dup if
name>interpret ( xt ) translate-num then
name>interpret translate-cell
else
drop translate-none then
exit then
\ 2drop notfound
\ this recognizer did not recognize anything, therefore:
rec-none ;
' noop ( x -- x ) \ int-xt ' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt :noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt
translate: translate-num
translate: translate-cell ````
AntonErtl
New Version: Recognizer committee proposal 2025-09-11
Recognizer committee proposal 2025-09-11
The committee has found consensus on the words in this proposal. I was asked to write it up.
Author:
M. Anton Ertl (based on previous work by Matthias Trute, Bernd Paysan, and others, and the input of the standardization committee).
Change Log:
- 2026-02-08 Fleshed out proposal; worked in feedback up to now.
2026-02-09 Specify the translation tokens of the
rec-...words. Also provide ( -- translation-token ) stack effects fortranslate-...words.2026-02-08 [r1614] Fleshed out proposal; worked in feedback up to now.
- 2025-09-12 [r1535] Some fixes
- 2025-09-12 [412] Initial version
Problem:
The classical text interpreter is inflexible: E.g., adding floating-point recognizers requires hardcoding the change; several systems include system-specific hooks (sometimes more than one) for plugging in functionality at various places in the text interpreter.
The difficulty of adding to the text interpreter may also have led to
missed opportunities: E.g., for string literals the standard did not
task the text interpreter with recognizing them, but instead
introduced S" and S\" (and their complicated definition with
interpretation and compilation semantics).
Solution:
The recognizer proposal is a factorization of the central part of the text interpreter.
As before the text interpreter parses a white-space-delimited string.
Unlike before, the string is now passed to the recognizers in the
default recognizer sequence rec-forth, one recognizer after another,
until one matches. The result of the matching recognizer is a
translation, an on-stack representation of the word or literal. The
translation is then processed according to the text-interpreter's
state (interpreting, compiling, postponing).
There are five usage levels of recognizers and related recognizer words:
Programs that use the default recognizers. This is the common case, and is essentially like using the traditional hard-coded Forth text interpreter. You do not need to use recognizer words for this level, but you can inform yourself about the recognizers in the current default recognizer sequence with
recs. The default recognizer sequence contains at leastrec-nameandrec-number, and, if the Floating-Point wordset is present,rec-float. Moreover, programmers can nowpostponenumbers and other recognized things.Programs that change which of the existing recognizers are used and in what order. The default recognizer sequence is
rec-forth. You can get the recognizers in it withget-recsand set them withset-recs. You can also create a recognizer sequence (which is a recognizer itself) withrec-sequence:. This proposal contains pre-defined recognizersrec-name rec-number rec-float rec-none, which can be used withset-recsor for defining a recognizer sequence.Programs that define new recognizers that use existing translation tokens. New recognizers are usually colon definitions, proposed-standard translation tokens are
translate-none translate-cell translate-dcell translate-float translate-name.Programs that define new translation tokens. New translation tokens are defined with
translate:.Programs that define text interpreters and programming tools that have to deal with recognizers. Words for achieving that are not defined in this proposal, but discussed in the rationale.
See the rationale for more detail and answers to specific questions.
Reference implementation:
TBD.
Testing:
TBD.
Proposal:
Usage requirements:
Data Types
translation: The result of a recognizer; the input of
interpreting, compiling, and postponing; it's a semi-opaque type
that consists of a translation token at the top of the data stack and
additional data on various stacks below.
additional data on various stacks.
translation token: (This has formerly been called a rectype.)
Single-cell item that identifies a certain translation.
Single-cell item that identifies a certain kind of translation.
Translations and text-interpretation
A recognizer pushes a translation on the stack(s). The text interpreter
(and other users, such as postpone) removes the translation from the
stack(s), and then either performs the interpreting run-time,
compiling run-time, or postponing run-time.
All the proposed-standard translate-... words only push a
translation token. Their stack effects are specified as expecting
some data on the stack and pushing a translation. This shows what
data is required in addition to the translation token to form a
complete translation. A proposed-standard translate-... word pushes
the same translation token every time it is invoked.
translation token, and that stack effect is given, but in addition the definitions of these words also show a "Stack effect to produce a translation"; this stack effect points out which additional stack items need to be pushed before the translation token in order to produce a translation.
A proposed-standard translate-... word pushes the same translation
token every time it is invoked.
Compiling and postponing run-time
Unless otherwise specified, the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.
Exceptions
Add the following exception to table 9.1:
-80 too many recognizers
Words
rec-name ( c-addr u -- translation )
(formerly rec-nt)
If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics
(interpreting, compiling, postponing) of that word (see
translate-name). If not, translation is translate-none.
(interpreting, compiling, postponing) of that word, and has the
translation token translate-name. If not, translation is
translate-none.
rec-number ( c-addr u -- translation )
(formerly rec-num)
If c-addr u is a single or double number (without or with prefix), or
a character, all as described in section 3.4.1.3 (Text interpreter
input number conversion), translation represents pushing that number
at run-time (see translate-cell, translate-dcell). If not,
translation is translate-none.
(formerly rec-num) If c-addr u is a single-cell or double-cell
number (without or with prefix), or a character, all as described in
section 3.4.1.3 (Text interpreter input number conversion),
translation represents pushing that number at run-time. If a
single-cell number is recognized, the translation token of translation
is translate-cell, for a double cell translate-dcell. If neither
is recognized, translation is translate-none.
rec-float ( c-addr u -- translation )
If c-addr u is a floating-point number, as described in section 12.3.7
(Text interpreter input number conversion), translation represents
pushing that number at run-time (see translate-float). If c-addr u
has the syntax of a double number without prefix according to section
8.3.1 (Text interpreter input number conversion), and it corresponds
to the floating-point number r according to section 12.6.1.0558
(>float), translation may represent pushing r at run-time. If
c-addr u is not recognized as a floating-point number, translation is
translate-none.
(Text interpreter input number conversion), rec-float recognizes it
as floating-point number r. If c-addr u has the syntax of a double
number without prefix according to section 8.3.1 (Text interpreter
input number conversion), and it corresponds to the floating-point
number r according to section 12.6.1.0558 (>float), rec-float may
(but is not required to) recognize it as a floating-point number. If
rec-float recognized c-addr u as floating-point number, translation
represents pushing that number at run-time, and the translation token
is translate-float. If c-addr u is not recognized as a
floating-point number, translation is translate-none.
rec-none ( c-addr u -- translation )
This word does not recognize anything. For its translation, see
translate-none.
This word does not recognize anything. Its translation and
translation token is translate-none.
recs ( -- )
(formerly .recognizers)
Print the recognizers in the recognizer sequence in rec-forth, the
first searched recognizer leftmost.
rec-forth ( c-addr u -- translation )
(formerly forth-recognize) This is a deferred word that contains the
recognizer (sequence) that is used by the Forth text interpreter.
rec-sequence: ( xtu .. xt1 u "name" -- )
Define a recognizer sequence "name" containing u recognizers
represented by their xts. If set-recs is implemented, the sequence
must be able to accomodate at least 16 recognizers.
name execution: ( c-addr u -- translation )
Execute xt1; if the resulting translation is the result of
translate-none, restore the data stack to ( c-addr u -- ) and try
the next xt. If there is no next xt, remove ( c-addr u -- ) and
perform translate-none.
translate-none ( -- translation )
get-recs ( xt -- xt_u ... xt_1 u )
xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.
set-recs ( xt_u ... xt_1 u xt -- )
xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.
translate-none ( -- translation-token )
(formerly r:fail or notfound)
Stack effect to produce a translation: ( -- translation )
translation interpreting run-time: ( ... -- )
-13 throw
translation compiling run-time: ( ... -- )
-13 throw
translation postponing run-time: ( ... -- )
-13 throw
translate-cell ( x -- translation )
translate-cell ( -- translation-token )
(formerly translate-num)
Stack effect to produce a translation: ( x -- translation )
translation interpreting run-time: ( -- x )
translate-dcell ( xd -- translation )
translate-dcell ( -- translation-token )
(formerly translate-dnum)
Stack effect to produce a translation: ( xd -- translation )
translation interpreting run-time: ( -- xd )
translate-float ( r -- translation )
translate-float ( -- translation-token )
Stack effect to produce a translation: ( r -- translation )
translation interpreting run-time: ( -- r )
translate-name ( nt -- translation )
translate-name ( -- translation-token )
(formerly translate-nt)
Stack effect to produce a translation: ( nt -- translation )
translation interpreting run-time: ( ... -- ... )
Perform the interpretation semantics of nt.
translation compiling run-time: ( ... -- ... )
Perform the compilation semantics of nt.
translate: ( xt-int xt-comp xt-post "name" -- )
(formerly rectype:)
Define "name"
"name" exection: ( i*x -- translation )
"name" exection: ( -- translation-token )
Stack effect to produce a translation: ( i*x -- translation )
"name" interpreting action: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-int.
"name" compiling action: ( ... translation -- ... )
Remove the top of stack (the translation token) and execute xt-comp.
"name" postponing action: ( translation -- )
Remove the top of stack (the translation token) and execute xt-post.
get-recs ( xt -- xt_u ... xt_1 u )
xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.
set-recs ( xt_u ... xt_1 u xt -- )
xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.
postpone
Interpretation:
Interpretation semantics for this word are undefined.
Compilation: ( "<spaces>name" -- )
Skip leading space delimiters. Parse name delimited by a space.
Use rec-forth to recognize name, resulting in translation with
translation-token. For a system-defined translation token, first
consume the translation, then compile the 'compiling' run-time. For a
user-defined translation token, remove it from the stack and execute
its post-xt.
Rationale
Names
The names of terms and the proposed Forth words in this proposal have been arrived at after several lengthy discussions in the committee. Experience tells me that many readers (including from the committee) will take issue with one or the other name, but any suggestion for changing names will be ignored by the me. If you want them changed, petition the committee (but I hope they will be as weary of renamings as I am).
In particular, I suggested to use "recognized" instead of
"translation", and IIRC also to rename the translate-... words
accordingly, but the committee eventually decided to stay with
translation and translate-....
Face it: The names are good enough. Any renaming, even if it results in a better name, increases the confusion more than it helps: even committee members (culprits in the renaming game themselves) have complained about being confused by the new, possibly better names for concepts and words that have already been present in Matthias Trute's proposal.
If you want to improve the proposal, please read it, play with the words in Gforth, read the reference implementation and the tests when they arrive, and point out any mistake or lack of clarity.
Translation tokens and translate-... words
[r1541]
points out interesting uses of knowledge about translation tokens,
and, conflictingly, potential implementation variations. This
proposal decides against the implementation variations and for the
uses by specifying in the Usage Requirements that a translate-...
word just pushes a translation token, and it always pushes the same
one.
Moreover, this proposal specifies the translation tokens that the
proposed-standard recognizers produce. This is useful in various
contexts where recognizers are not used directly in rec-forth, and
it also makes it possible to write tests for the recognizers.
Discarding a translation
[r1541]
also asks for a way to discard a translation. This need has also come
up in some recognizers implemented in Gforth (e.g., rec-tick), and
Gforth uses (non-standard) words like sp@ and sp! for that.
Standard options would be to wrap the word that pushes a translation
into catch and discard the stacks with a non-zero throw, or to use
depth and fdepth in combination with loops of drop and fdrop;
both ways are cumbersome. My feeling is that many in the committee
and in the wider Forth community do not see the need for
discard-translation yet; this may change in the future.
Consumers of translations (Usage level 5)
The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, so such words are only used internally in the text interpreter, eliminating the need to standardize them.
However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:
interpreting ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.
compiling ( ... translation -- ... )
For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.
postponing ( ... translation -- )
For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.
Typical use:
s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack
: umin ( u1 u2 -- u )
2dup u< if drop else nip then ;
: string-prefix? ( c-addr1 u1 c-addr2 u2 -- f )
tuck 2>r umin 2r> compare 0= ;
: rec-tick ( addr u -- translation )
2dup "`" string-prefix? if
1 /string find-name dup if
name>interpret translate-cell
else
drop translate-none then
exit then
\ this recognizer did not recognize anything, therefore:
rec-none ;
' noop ( x -- x ) \ int-xt
' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt
translate: translate-cell
EkkehardSkirl
Remarks on: Solution
List of five usage level - Level 2:
We should use to name 'recognizer sequences' consequently recs-xxx to name recognizer sequensces and rec-xxx to name recognizers. So 'rec-forth' should be 'recs-forth'(Naming)
List of five usage level - Level 3:
On the first glance I was confused abaut the suddenly appearing 'translation token', since the preamble talked only about 'translation'. So it could be more clear to append a partial sentence to 'The result of the matching recognizer is a translation, an on-stack representation of the word or literal.' like 'The result of the matching recognizer is a translation, an on-stack representation of the word or literal, consistimng of a translation token and possibly additional data.'
The currently proposed names of the translation token tell me, that there is an action, but it is a kind of type. So they should called more readable 'translation-XXX' and since we use the short 'rec' for 'recognizer' it is consquently to call them shorter, for instance 'trl-XXX'(Naming)
Remarks on: Proposal:
Usage requirements:
Translations and text-interpretation
The second paragraph sounds a little bit confusing to me.
'All the proposed-standard translate-... words only push a translation token, and that stack effect is given, but in addition the definitions of these words also show a "Stack effect to produce a translation"; ...'
May be this confusion is based on my less knowledge of the english common speech.
But why should a word push a 'stack effect' or an information about existing stack effects, with what a kind of purpose? This translation tokens are identifier to know about which recognizer accepted the given string token and I think used to compare. And of course, the documentation of the recognizer that produces this token must document this stack effects?
This text may be written as poposed here:
'All the proposed-standard translate-... words only push a translation token. In addition the definitions of these words shall also contain a "Stack effect to produce a translation" (for instance as a comment in its colon definition); ... '
AntonErtl
Tests
I did not want to repost everything just to post the current status of the tests, so here they are just as a reply:
You can find tests here. As they currently are, they work on Gforth, but there seems to be a bug in either the tests or in Gforth that appears when you uncomment one or both of the commented-out tests (search for \ t{).