Proposal: Recognizer committee proposal 2025-09-11

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl [412] Recognizer committee proposal 2025-09-11Proposal2025-09-12 03:56:07

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see translate-name). (formerly called rec-nt). If not, translation is translate-none.

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ..., translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see translate-none. (formerly known as notfound and r:fail)

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

AntonErtlavatar of AntonErtlNew Version: Recognizer committee proposal 2025-09-11

Hide differences

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see

translate-name). (formerly called rec-nt). If not, translation is translate-none.

translate-name). If not, translation is translate-none. (formerly called rec-nt)

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ...,

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see

translate-none. (formerly known as notfound and r:fail)

translate-none.

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

EricBlakeavatar of EricBlake

I'm aware you intend to flush this out further, but when doing so, the following observations may be useful.

Examples

 s" 123" rec-forth ( translation ) interpreting

It might be helpful to also show the stack effect after interpreting, as in:

s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack

If I understand the intent correctly, the difference between rec-none and translate-none is that both produce the same translation (which consists of a translation token and possibly additional cells), while only the former also consumes addr u before pushing that translation. Taking it further, since translation is a semi-opaque type of one or more cells, I think I can still implement it where the translation token returned by translate-none is a literal 0 (trivially as 0 constant translate-none), provided that my interpreting/compiling/postponing all recognize a literal 0 as the built-in translation whose effects are to result in -13 throw. Or I could implement :noname -13 throw ; dup dup translate: translate-none where the translation token is a non-zero xt just like any other user-defined word created by translate:, and then interpreting/compiling/postponing don't have to special-case 0. But either way, my implementation choice for translate-none is not unduly constrained by the standard, and not relevant or visible to the user; but it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But given that analysis, It means that the standard should not prohibit a translation token from being a literal 0; when compared to the other recent work in r1533 to designate that xt => x \ flag, the standard should be clear that translation token => x and not translation token => xt.

Continuing with the examples,

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then

Why is this calling out translate-num instead of translate-cell?

        exit then

This exit leaves the stack with either xt translate-num or 0; the former makes sense but the latter assumes translate-none produces a literal 0, which I just argued above is a specific implementation choice rather than something the standard mandates. Would it be better as:

... find-name ?dup if name>interpret ( xt ) translate-cell exit then translate-none exit ...
    \ 2drop notfound
    rec-none ;

I like how rec-none serves the same role as the former 2drop notfound, but does leaving in the comment aid in understanding the example?

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

Is this part of the example intended to be a potential reference implementation of translate-cell? Or is it intended to supply the translate-num used in the rec-tick above, in which case the order of presentation should be swapped?

ruvavatar of ruv

translate-cell ( x -- translation )
translate-name ( nt -- translation )

The naming scheme translate-*** is inappropriate and confusing for these words; for example, the name translate-name implies that the word performs some translation, but this word actually does not perform any translation, it is just a constant (i.e., it simply pushes a single-cell token on the stack; and this should be indicated in its stack diagram).

We should find a better naming scheme for these words.

Possible options:

  • ***-recognized
  • ***-tag or tag-*** (because effectively this value is a data type tag for a data object)
  • td-*** (from "token discriminator" or "token descriptor", similar to tag)

Other?


@EricBlake wrote:

I think I can still implement it where the translation token returned by translate-none is a literal 0

Yes, for example, in Gforth it is currently implemented this way.

Zero value on unsuccess simplifies analyzing — you can do «dup if ...» or «if ...» instead of «dup tag-none <> if ... ». If most implementations stick to this approach, it can be standardized.

Why is this calling out translate-num instead of translate-cell?

This is probably an oversight after renamings.

EricBlakeavatar of EricBlake

it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But now in trying to code something up that uses recognizers from the user's point of view, I tried to write a quick word that accepts single-cell numbers but rejects character strings that would produce a double cell or not be recognized as a number. Using gforth's implementation, it might be as simple as:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    translate-cell of ( n ) true exit endof
    translate-none of ( -- ) 0 exit endof
    translate-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation ), and my usage above did not satisfy that requirement. So, what if I modify it along these lines, to ensure that every time translate-cell is used, there is an n on the stack before-hand (and then jumping through hoops to get it back off the stack)?

: token-none ( -- token ) \ determine the token produced by translate-none
  translate-none
;
: token-cell ( -- token ) \ determine the token produced by translate-cell
  0 translate-cell dup >r interpreting drop r>
;
: token-dcell ( -- token ) \ determine the token produced by translate-dcell
  #0. translate-dcell dup >r interpreting 2drop r>
;
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    token-cell of ( n ) true exit endof
    token-none of ( -- ) 0 exit endof
    token-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

Alas, even that seems like it is not portable to the proposed spec, but has an environmental dependency on gforth's implementation. There's nothing in the proposed wording that requires the translation token cell to be identical regardless of what the rest of the overall translation represents. In fact, I see nothing that requires translate-none to produce a single cell, nor for any other given flavor of translation to occupy a consistent number of cells regardless of the value being translated. The proposal is very clear that 123 translate-cell interpreting will leave 123 on the stack, but intentionally does not state how many cells of the stack are in use in between translate-cell and interpreting (only that it is a semi-opaque type of one or more cells).

Put differently, gforth's implementation for translate-cell happens to be idempotent and produce a translation that occupies two cells (namely, the value of the cell being translated, and a single-cell translation token); but based on just the proposed spec, what would prevent an alternative implementation that has a translation occupy exactly one cell (namely, a pointer to an internal struct that wraps multiple pieces of information, including the value to push, the current line/offset of the source at the time the call to translate-cell was made in order to make for more friendly SEE output, and so on). With such an implementation, I could argue that 0 translate-cell 0 translate-cell = producing false is acceptable, because it results in two different pointers (there were two different source locations at the time of the two different invocations of translate-cell). Or what would prevent an implementation where 0 translate-cell occupies one cell, because it is a frequently-encountered and worth special-casing in the interpreter loop, vs. 123 translate-cell occupying two cells, because it is infrequently encountered?

From the user's point of view, it would be a lot more powerful if we had a guarantee that a given translate-XXX produces the same translation token at the top of the stack (even if the rest of the stack is variable-length), and that a given rec-XXX produces idempotent output (maybe with limitations on how SOURCE and >IN can be changed between the recognizer and the action on the translation). It would also be nice if we could guarantee that ALL instances of translate-XXX have the behavior of pushing a single cell to the stack, where that cell is constant for a given translate-XXX, and document that comparing translation tokens is well-defined, and that the rest of the stack diagram for a translator only matters if the resulting translation will be further passed to interpreting/compiling/postponing (ie. translate-cell drop is always unambiguous and cannot cause stack underflow, but it is ambiguous behavior to attempt translate-cell compiling if there was not an n on the stack).

Finally, it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- ), so that I could rewrite my earlier example more compactly:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  dup translate-cell = if ( n token-cell ) drop true exit then
  ( 0 | d token-dcell -- ) discard 0

ruvavatar of ruv

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation )

The idea is that translate-cell has stack effect ( -- x.td ), and ( x x.td ) is a translation. The name translate-cell with its stack diagram is very confusing. I think this proposal needed more work before publication.

See also my suggestion for naming these data types.

ruvavatar of ruv

it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- )

Yes, it would be useful and convenient. But then we should associate more information with data type identifiers. I think I would implement this.

In this proposal, "translation" (yes, the name is inappropriate) in stack diagrams is a data type symbol for the data type that is a subtype of ( ut td ), which is a pair of ut (unqualified token) and td (token descriptor) at the top, where:

  • td => x, i.e., the token descriptor (a data type) is a subtype of the unspecified cell;
  • ut => ( F: j*r ; S: i*x ; ), i.e., the unqualified token (a data type) is an arbitrary tuple (possibly empty) of r and x values, so that r values reside on the floating point stack, x values reside on the data stack.

In the language of type theories, ( ut td ) is a dependent pair type, since each member of td is associated with some particular subtype of uq. In other words, the value of the stack parameter td determines the type of the stack parameter ut. Therefore, a member of the type ( ut td ) can be interpreted as a tagged data object, in which the value td is a tag for the value ut.

Thus, each member of td is also an identifier of some data type. The Forth system associates with this identifier information about how to translate (interpret or compiler) the members of the corresponding data type (a subtype of ut). Of course, this identifier may also be associated with information about how to remove the members of this data type from the stacks.

Since the user can effectively define own data types, we should provide a way to create a token descriptor (a members of td) and associate various information with it in several steps. The main information piece is about translation of the data type members. Information about postponing and discarding (removing from the stacks) my be optional.

Regarding terminology/naming. We can use the term "data type identifier" instead of "token descriptor", but 1) this name is longer, 2) then there will be a number of terms that look very similar: "data type", "data type symbol", "data type identifier". Therefore, I would prefer more distinguishable terms.

As an option, instead of "token descriptor" we can use "type descriptor".


An alternative solution to remove a qualified token from the stack is to determine its size. This can be done by storing the stack depth before the qualified token is placed on the stack and then calculating the change in stack depth. For example, see the word apply-recognizer-filter in my recognizer/filter.fth and its use in the word available-xt in example.text-translator.fth.

AntonErtlavatar of AntonErtl

The mentions of translate-num in the examples are oversights and should be translate-cell. The way that the example rec-tick deals with the case where the word is not found does not work with a non-zero translate-none. A correct implementation is:

: rec-tick ( addr u -- translation ) \ gforth-experimental
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret translate-cell exit then
        drop translate-none exit then
    rec-none ;

EricBlakeavatar of EricBlake

: rec-tick ( addr u -- translation ) \ gforth-experimental
   over c@ '`' = if

The \ gforth-experimental comment can be dropped. Is a recognizer guaranteed that u will be non-zero, or is this c@ at risk of reading beyond the bounds of the input argument? And if the recognizer is called on the length-1 string "`", should this example be relying on the implementation-defined results of c-addr 0 find-name (likely 0, but possibly an xt if the implementation allows for an empty-length dictionary entry)?

ruvavatar of ruv

I think, we should fix the following problems.

The term "translation"

The term translation is not suitable to denote the general type of recognizers result. Since "translation" is either an act of translating, or a product of translating (not recognizing). Even the term "recognition" is more suitable, if someone likes it.

Another possible option: "recognized", which will be used as a nominalized adjective (i.e., a noun).

We also need a separate term to denote the type of the topmost x value of a successful recognizing result.

The scheme translate-something

The naming scheme translate-something is not suitable for words that have type ( -- x ) and are constants.

  • Effectively, any member of this naming scheme is a verb phrase; this scheme was intended for words that perform translation (interpretation or compilation), which is an active action with possible side effects.
    • For example, translate-nt ( i*x nt -- j*x ).

A word that is a constant should have the name that is a noun or a noun phrase.

This naming scheme should be aligned with the corresponding general data type name/symbol.

The names get-recs and set-recs

The pair of words ( get-recs, set-recs ) is similar to the pair of standard words ( get-order, set-order ) by the form of their names, but very different conceptually, since they accept the object on the top. This is an inconsistency in naming conventions.

Better naming options are:

  • recs@ and recs!
    • "recs" in these names denotes the pair of types at once
      • the type of a data object that is fetched or stored
      • the type of a target data object
    • it is also similar on some extend to the pairs of standard words (defer@, defer!), (c@, c!), (2@, 2!)
      • see also my post on ForthHub in this regard.
  • fetch-recs and store-recs

The names translate: and rec-sequence:

The corresponding words are proposed as defining words.

Traditionally, a colon was only used in the names of standard defining words that have a counterpart word with a semi-colon in the name. So, this name is inconsistent with other names. Note that this tradition was broken bye new "*field:" words (but not +field).

  • Can we avoid a colon in the defining words that don't have a counterpart word with a semicolon?

The name rec-sequence: is too close to rec-sequence that is a member of the rec-something naming scheme. This is inconsistent and confusing.

  • A possible option: recs — an abbreviation of "recognizers sequence", which is "sequence of recognizers".
    • Maybe it is better if if this word was like wordlist, which produces a new identifier on the stack without creating a word.

ruvavatar of ruv

Erratum:

see also my post on ForthHub in this regard.

The correct link

Josefavatar of Josef

I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge. I utilized this proposal, BerndPaysan's retired recognizer proposal, FORTH Inc.'s recognizer page, and the comments here and on the mailing list.

Suggestions short summary

Remove the "translation" term because it's obfuscates the possible outputs, explained below.

Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.

A recognizer definition is proposed below.

It doesn't seem that the connection between a recognizer's pattern and the rest of the steps is really discussed. Matching a text token to the pattern is the first step. The parameters fetched according to the recognizer's pattern. The run-time action table is associated with a specific pattern parameter.

Recognizer term

From this proposal and FORTH, Inc.'s write up, the following seems to be how to design a recognizer:

  1. Determine the text pattern of the data.
    • _E.g. complex numbers follow an "a+bi" pattern.
  2. Create a pattern matching algorithm for the text pattern.
  3. Determine pattern parameters to be fetched.
  4. Determine the run-time action tables to pair with the fetched pattern parameters.

Recognizer definiton proposal: A recognizer attempts to match a text token to a pattern. A successful text token match invokes fetching the pattern parameters and the associated run-time action table. A failed matching attempt outputs a rat-none . The text interpreter (and other users, such as postpone ), utilizes the run-time action table to perform either the interpreting run-time, compiling run-time, or postponing run-time.

rec-sequence: make-rec-sequence ( xtu .. xt1 u "name" -- )

rec-name ( c-addr u -- xt rat | rat-none)

rec-num ( c-addr u -- i*x rat | rat-none)

rec-none ( c-addr u -- rat-none )

I agree with @ruv's suggestions for get-recs and set-recs, i.e. recs@ & recs!.

Translation Term

Translation seems to hide information. The relationship between the pattern parameters and the run-time action table is fixed. Because different recognizers produce different outputs, using "translation" as a catchall obscures the output, rather than listing the output i*x rat , xt rat , etc.

translation token run-time action table: Single-cell item that contains the run-time actions associated with specific pattern parameters, i.e. interpreting run-time, compiling run-time, and postponing run-time. (This has formerly been called a rectype, translation token. It's a table of run-time actions.)

translate: make-rat ( xt-int xt-comp xt-post "name" -- )

translate-word rat-word ( -- rat )

pattern parameters: is the optional set of data fetched after a successful text token match. The set is on various stacks below the run-time action table. (This could use a better name, not sure if it's really needed, but it helped my thinking.)

I walk through the examples below with the notes above.

Example: REC-NAME

FORTH, Inc. has this example:

' EXECUTE ' COMPILE, ' POSTPONE, TRANSLATE: TRANSLATE-WORD
' EXECUTE ' EXECUTE  ' COMPILE,  TRANSLATE: TRANSLATE-IMM
: REC-NAME ( c-addr len -- xt addr1 | addr2 )
    (FIND) CASE
        -1 OF  TRANSLATE-WORD  ENDOF
        1 OF  TRANSLATE-IMM  ENDOF
        0 OF  TRANSLATE-NONE  ENDOF
    ENDCASE ;

Compared to the steps above:

  1. Data to be handled is "words in general".
  2. The pattern is a word is in the dictionary.
  3. The pattern parameters fetched could be:
    • xt 1
    • xt -1
    • cddr 0
  4. Pattern parameters are associated to rats as follows:
    • 1 to TRANSLATE-WORD
    • -1 to TRANSLATE-IMM
    • 0 to TRANSLATE-NONE (originally, NOTFOUND).

(FIND) completes both Steps 2 & 3. The rat output is based on the pattern parameters fetched, not the pattern being matched.

Example: REC-TICK

From the proposal:

: rec-tick ( addr u -- translation ) \ gforth-experimental
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret translate-cell exit then
        drop translate-none exit then
    rec-none ;

Walking through the steps:

  1. The data to be handled is a ticked word.
  2. The pattern is a name in the dictionary.
  3. The pattern parameters fetched by 1 /string find-name could be:
    • nt
    • 0
  4. Pattern parameters are associated to rats as follows:
    • nt to translate-cell
    • 0 to translate-none

The rat output is based on the pattern parameters fetched, not the pattern being matched. 2drop translate-none seems clearer than rec-none. I keep getting caught looking at the rec-tick example thinking "what is rec-none recognizing?"

Example Observations

  • Pattern matching and pattern parameter fetching can be combined or separate words.
  • It would be reasonable to have a failed pattern parameter fetch be an error. A pitfall of creating recognizers is ensuring there is little to no overlap of patterns.
    • E.g. 'bob is the name as defined, processed by rec-name. 'stan is ticked version of stan processed by rec-tick.
  • rec-none could be the final recognizer in recognizer sequences, exiting any further evaluation. Instead of creating a new sequence, one could move rec-none earlier in the sequence.

Thank you for reading this far, hopefully there is more food for thought, than madness.

ruvavatar of ruv

@Josef wrote:

I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge.

Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.

I'm making one more attempt on this matter.

The language of the Standard already uses concepts such as data object, data type, typed data object, and subtyping (see 3.1 Data types).

Using these concepts, we can describe a successful recognition result as a pair consisting of a data object and its corresponding data type.

On the stack, data types must be represented by specific identifiers, similar to how semantics elements are represented by xt identifiers. We might refer to such an identifier as a type descriptor (symbol td).

  • Note: "type descriptor" is preferred over "type identifier" because, in the language of the Standard, we will need expressions like "type descriptor td identifies ...". Using "type identifier" would lead to awkward repetitions such as "type identifier ti identifies ...".
  • Another option for this term could be "type token" (seems less preferable).

Additionally, we might define a qualified data object (symbol qdo) as a pair consisting of a data object and the type descriptor that identifies that object's data type.

  • Note. This concept should be distinguished from the existing concept of a "typed data object".

The elegance and strength of this approach lie in the following points:

  • It builds upon existing terminology, with only slight extensions.
  • It incorporates existing data type symbols into naming conventions.
  • It leverages subtyping relationships between data types to reduce redundancy (adhering to the DRY principle).

Type descriptors can be used to:

  • Translate data objects (into the body of a Forth definition when compiling or side effects when interpreting).
  • Convert data objects to different data types (casting).
    • E.g., getting xt from nt (for example, of an ordinary word only)
  • Check subtyping relationships between data types (or of a qualified data object).
  • Define new type descriptors.

These features can be designed independently of recognizers, and recognizers only rely on them when returning a qualified data object or analyzing a qualified data object from another recognizer.

AntonErtlavatar of AntonErtlNew Version: Recognizer committee proposal 2025-09-11

Hide differences

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Recognizer committee proposal 2025-09-11

Translations and text-interpretation

The committee has found consensus on the words in this proposal. I was asked to write it up.

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Author:

Unless otherwise specified the compiling run-time compiles the

M. Anton Ertl (based on previous work by Matthias Trute, Bernd Paysan, and others, and the input of the standardization committee).

Change Log:

  • 2026-02-08 Fleshed out proposal; worked in feedback up to now.
  • 2025-09-12 [r1535] Some fixes
  • 2025-09-12 [412] Initial version

Problem:

The classical text interpreter is inflexible: E.g., adding floating-point recognizers requires hardcoding the change; several systems include system-specific hooks (sometimes more than one) for plugging in functionality at various places in the text interpreter.

The difficulty of adding to the text interpreter may also have led to missed opportunities: E.g., for string literals the standard did not task the text interpreter with recognizing them, but instead introduced S" and S\" (and their complicated definition with interpretation and compilation semantics).

Solution:

The recognizer proposal is a factorization of the central part of the text interpreter.

As before the text interpreter parses a white-space-delimited string. Unlike before, the string is now passed to the recognizers in the default recognizer sequence rec-forth, one recognizer after another, until one matches. The result of the matching recognizer is a translation, an on-stack representation of the word or literal. The translation is then processed according to the text-interpreter's state (interpreting, compiling, postponing).

There are five usage levels of recognizers and related recognizer words:

  1. Programs that use the default recognizers. This is the common case, and is essentially like using the traditional hard-coded Forth text interpreter. You do not need to use recognizer words for this level, but you can inform yourself about the recognizers in the current default recognizer sequence with recs. The default recognizer sequence contains at least rec-name and rec-number, and, if the Floating-Point wordset is present, rec-float. Moreover, programmers can now postpone numbers and other recognized things.

  2. Programs that change which of the existing recognizers are used and in what order. The default recognizer sequence is rec-forth. You can get the recognizers in it with get-recs and set them with set-recs. You can also create a recognizer sequence (which is a recognizer itself) with rec-sequence:. This proposal contains pre-defined recognizers rec-name rec-number rec-float rec-none, which can be used with set-recs or for defining a recognizer sequence.

  3. Programs that define new recognizers that use existing translation tokens. New recognizers are usually colon definitions, proposed-standard translation tokens are translate-none translate-cell translate-dcell translate-float translate-name.

  4. Programs that define new translation tokens. New translation tokens are defined with translate:.

  5. Programs that define text interpreters and programming tools that have to deal with recognizers. Words for achieving that are not defined in this proposal, but discussed in the rationale.

See the rationale for more detail and answers to specific questions.

Reference implementation:

TBD.

Testing:

TBD.

Proposal:

Usage requirements:

Data Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: (This has formerly been called a rectype.) Single-cell item that identifies a certain translation.

Translations and text-interpretation

A recognizer pushes a translation on the stack(s). The text interpreter (and other users, such as postpone) removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

All the proposed-standard translate-... words only push a translation token. Their stack effects are specified as expecting some data on the stack and pushing a translation. This shows what data is required in addition to the translation token to form a complete translation. A proposed-standard translate-... word pushes the same translation token every time it is invoked.

Compiling and postponing run-time

Unless otherwise specified, the compiling run-time compiles the

interpreting run-time. The postponing run-time compiles the compiling run-time.

Exceptions

Types

Add the following exception to table 9.1:

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

-80 too many recognizers

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

Words

rec-name ( c-addr u -- translation )

rec-name ( c-addr u -- translation )

(formerly rec-nt)

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see translate-name). If not, translation is translate-none.

(formerly called rec-nt)

rec-number ( c-addr u -- translation )

rec-number ( c-addr u -- translation )

(formerly rec-num)

If c-addr u is a single or double number (without or with prefix), or

a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

a character, all as described in section 3.4.1.3 (Text interpreter input number conversion), translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ..., translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If

If c-addr u is a floating-point number, as described in section 12.3.7 (Text interpreter input number conversion), translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section 8.3.1 (Text interpreter input number conversion), and it corresponds to the floating-point number r according to section 12.6.1.0558 (>float), translation may represent pushing r at run-time. If

c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see translate-none.

recs ( -- )

recs ( -- )

(formerly .recognizers)

Print the recognizers in the recognizer sequence in rec-forth, the

first searched recognizer leftmost. (formerly known as .recognizers)

first searched recognizer leftmost.

rec-forth ( c-addr u -- translation )

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

(formerly forth-recognize) This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter.

rec-sequence: ( xtu .. xt1 u "name" -- )

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

translate-none ( -- translation )

(formerly r:fail or notfound)

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

translate-cell ( x -- translation )

(formerly translate-num)

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

(formerly rectype:)

Define "name"

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

"name" interpreting action: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

"name" compiling action: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

"name" postponing action: ( translation -- )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched

first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.

postpone

Interpretation:

Interpretation semantics for this word are undefined.

Compilation: ( "<spaces>name" -- )

Skip leading space delimiters. Parse name delimited by a space. Use rec-forth to recognize name, resulting in translation with translation-token. For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Rationale

(This will also be fleshed out)

Names

The names of terms and the proposed Forth words in this proposal have been arrived at after several lengthy discussions in the committee. Experience tells me that many readers (including from the committee) will take issue with one or the other name, but any suggestion for changing names will be ignored by the me. If you want them changed, petition the committee (but I hope they will be as weary of renamings as I am).

In particular, I suggested to use "recognized" instead of "translation", and IIRC also to rename the translate-... words accordingly, but the committee eventually decided to stay with translation and translate-....

Face it: The names are good enough. Any renaming, even if it results in a better name, increases the confusion more than it helps: even committee members (culprits in the renaming game themselves) have complained about being confused by the new, possibly better names for concepts and words that have already been present in Matthias Trute's proposal.

If you want to improve the proposal, please read it, play with the words in Gforth, read the reference implementation and the tests when they arrive, and point out any mistake or lack of clarity.

Translation tokens and translate-... words

[r1541] points out interesting uses of knowledge about translation tokens, and, conflictingly, potential implementation variations. This proposal decides against the implementation variations and for the uses by specifying in the Usage Requirements that a translate-... word just pushes a translation token, and it always pushes the same one.

Discarding a translation

[r1541] also asks for a way to discard a translation. This need has also come up in some recognizers implemented in Gforth (e.g., rec-tick), and Gforth uses (non-standard) words like sp@ and sp! for that. Standard options would be to wrap the word that pushes a translation into catch and discard the stacks with a non-zero throw, or to use depth and fdepth in combination with loops of drop and fdrop; both ways are cumbersome. My feeling is that many in the committee and in the wider Forth community do not see the need for discard-translation yet; this may change in the future.

Consumers of translations (Usage level 5)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text

interpreters, and such words are only used internally in the text interpreter.

interpreters, so such words are only used internally in the text interpreter, eliminating the need to standardize them.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

Typical use:


s" 123" rec-forth ( translation ) interpreting

s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack

: umin ( u1 u2 -- u ) 2dup u< if drop else nip then ;

: string-prefix? ( c-addr1 u1 c-addr2 u2 -- f ) tuck 2>r umin 2r> compare 0= ;

: rec-tick ( addr u -- translation )

over c@ '`' = if
2dup "`" string-prefix? if
    1 /string find-name dup if
        name>interpret ( xt ) translate-num then
        name>interpret translate-cell
    else
        drop translate-none then
    exit then
\ 2drop notfound
\ this recognizer did not recognize anything, therefore:
rec-none ;

' noop ( x -- x ) \ int-xt ' lit, ( compilation: x -- ; run-time: -- x ) \ comp-xt :noname lit, postpone lit, ; ( postponing: x -- ; run-time: -- x ) \ post-xt

translate: translate-num


translate: translate-cell ````

AntonErtlavatar of AntonErtlNew Version: Recognizer committee proposal 2025-09-11

Hide differences

Recognizer committee proposal 2025-09-11

The committee has found consensus on the words in this proposal. I was asked to write it up.

Author:

M. Anton Ertl (based on previous work by Matthias Trute, Bernd Paysan, and others, and the input of the standardization committee).

Change Log:

  • 2026-02-08 Fleshed out proposal; worked in feedback up to now.
  • 2026-02-09 Specify the translation tokens of the rec-... words. Also provide ( -- translation-token ) stack effects for translate-... words.

  • 2026-02-08 [r1614] Fleshed out proposal; worked in feedback up to now.

  • 2025-09-12 [r1535] Some fixes
  • 2025-09-12 [412] Initial version

Problem:

The classical text interpreter is inflexible: E.g., adding floating-point recognizers requires hardcoding the change; several systems include system-specific hooks (sometimes more than one) for plugging in functionality at various places in the text interpreter.

The difficulty of adding to the text interpreter may also have led to missed opportunities: E.g., for string literals the standard did not task the text interpreter with recognizing them, but instead introduced S" and S\" (and their complicated definition with interpretation and compilation semantics).

Solution:

The recognizer proposal is a factorization of the central part of the text interpreter.

As before the text interpreter parses a white-space-delimited string. Unlike before, the string is now passed to the recognizers in the default recognizer sequence rec-forth, one recognizer after another, until one matches. The result of the matching recognizer is a translation, an on-stack representation of the word or literal. The translation is then processed according to the text-interpreter's state (interpreting, compiling, postponing).

There are five usage levels of recognizers and related recognizer words:

  1. Programs that use the default recognizers. This is the common case, and is essentially like using the traditional hard-coded Forth text interpreter. You do not need to use recognizer words for this level, but you can inform yourself about the recognizers in the current default recognizer sequence with recs. The default recognizer sequence contains at least rec-name and rec-number, and, if the Floating-Point wordset is present, rec-float. Moreover, programmers can now postpone numbers and other recognized things.

  2. Programs that change which of the existing recognizers are used and in what order. The default recognizer sequence is rec-forth. You can get the recognizers in it with get-recs and set them with set-recs. You can also create a recognizer sequence (which is a recognizer itself) with rec-sequence:. This proposal contains pre-defined recognizers rec-name rec-number rec-float rec-none, which can be used with set-recs or for defining a recognizer sequence.

  3. Programs that define new recognizers that use existing translation tokens. New recognizers are usually colon definitions, proposed-standard translation tokens are translate-none translate-cell translate-dcell translate-float translate-name.

  4. Programs that define new translation tokens. New translation tokens are defined with translate:.

  5. Programs that define text interpreters and programming tools that have to deal with recognizers. Words for achieving that are not defined in this proposal, but discussed in the rationale.

See the rationale for more detail and answers to specific questions.

Reference implementation:

TBD.

Testing:

TBD.

Proposal:

Usage requirements:

Data Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and

additional data on various stacks below.

additional data on various stacks.

translation token: (This has formerly been called a rectype.)

Single-cell item that identifies a certain translation.

Single-cell item that identifies a certain kind of translation.

Translations and text-interpretation

A recognizer pushes a translation on the stack(s). The text interpreter (and other users, such as postpone) removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

All the proposed-standard translate-... words only push a

translation token. Their stack effects are specified as expecting some data on the stack and pushing a translation. This shows what data is required in addition to the translation token to form a complete translation. A proposed-standard translate-... word pushes the same translation token every time it is invoked.

translation token, and that stack effect is given, but in addition the definitions of these words also show a "Stack effect to produce a translation"; this stack effect points out which additional stack items need to be pushed before the translation token in order to produce a translation.

A proposed-standard translate-... word pushes the same translation token every time it is invoked.

Compiling and postponing run-time

Unless otherwise specified, the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Exceptions

Add the following exception to table 9.1:

-80 too many recognizers

Words

rec-name ( c-addr u -- translation )

(formerly rec-nt)

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics

(interpreting, compiling, postponing) of that word (see translate-name). If not, translation is translate-none.

(interpreting, compiling, postponing) of that word, and has the translation token translate-name. If not, translation is translate-none.

rec-number ( c-addr u -- translation )

(formerly rec-num) If c-addr u is a single or double number (without or with prefix), or a character, all as described in section 3.4.1.3 (Text interpreter input number conversion), translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

(formerly rec-num) If c-addr u is a single-cell or double-cell number (without or with prefix), or a character, all as described in section 3.4.1.3 (Text interpreter input number conversion), translation represents pushing that number at run-time. If a single-cell number is recognized, the translation token of translation is translate-cell, for a double cell translate-dcell. If neither is recognized, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section 12.3.7

(Text interpreter input number conversion), translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section 8.3.1 (Text interpreter input number conversion), and it corresponds to the floating-point number r according to section 12.6.1.0558 (>float), translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

(Text interpreter input number conversion), rec-float recognizes it as floating-point number r. If c-addr u has the syntax of a double number without prefix according to section 8.3.1 (Text interpreter input number conversion), and it corresponds to the floating-point number r according to section 12.6.1.0558 (>float), rec-float may (but is not required to) recognize it as a floating-point number. If rec-float recognized c-addr u as floating-point number, translation represents pushing that number at run-time, and the translation token is translate-float. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see translate-none.

This word does not recognize anything. Its translation and translation token is translate-none.

recs ( -- )

(formerly .recognizers) Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost.

rec-forth ( c-addr u -- translation )

(formerly forth-recognize) This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter.

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.

translate-none ( -- translation-token )

(formerly r:fail or notfound)

Stack effect to produce a translation: ( -- translation )

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

translate-cell ( -- translation-token )

(formerly translate-num)

Stack effect to produce a translation: ( x -- translation )

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

translate-dcell ( -- translation-token )

(formerly translate-dnum)

Stack effect to produce a translation: ( xd -- translation )

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translate-float ( -- translation-token )

Stack effect to produce a translation: ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

translate-name ( -- translation-token )

(formerly translate-nt)

Stack effect to produce a translation: ( nt -- translation )

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

(formerly rectype:)

Define "name"

"name" exection: ( i*x -- translation )

"name" exection: ( -- translation-token )

Stack effect to produce a translation: ( i*x -- translation )

"name" interpreting action: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

"name" compiling action: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

"name" postponing action: ( translation -- )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw -80 (too many recognizers) if u exceeds the number of elements supported by the recognizer sequence.

postpone

Interpretation:

Interpretation semantics for this word are undefined.

Compilation: ( "<spaces>name" -- )

Skip leading space delimiters. Parse name delimited by a space. Use rec-forth to recognize name, resulting in translation with translation-token. For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Rationale

Names

The names of terms and the proposed Forth words in this proposal have been arrived at after several lengthy discussions in the committee. Experience tells me that many readers (including from the committee) will take issue with one or the other name, but any suggestion for changing names will be ignored by the me. If you want them changed, petition the committee (but I hope they will be as weary of renamings as I am).

In particular, I suggested to use "recognized" instead of "translation", and IIRC also to rename the translate-... words accordingly, but the committee eventually decided to stay with translation and translate-....

Face it: The names are good enough. Any renaming, even if it results in a better name, increases the confusion more than it helps: even committee members (culprits in the renaming game themselves) have complained about being confused by the new, possibly better names for concepts and words that have already been present in Matthias Trute's proposal.

If you want to improve the proposal, please read it, play with the words in Gforth, read the reference implementation and the tests when they arrive, and point out any mistake or lack of clarity.

Translation tokens and translate-... words

[r1541] points out interesting uses of knowledge about translation tokens, and, conflictingly, potential implementation variations. This proposal decides against the implementation variations and for the uses by specifying in the Usage Requirements that a translate-... word just pushes a translation token, and it always pushes the same one.

Moreover, this proposal specifies the translation tokens that the proposed-standard recognizers produce. This is useful in various contexts where recognizers are not used directly in rec-forth, and it also makes it possible to write tests for the recognizers.

Discarding a translation

[r1541] also asks for a way to discard a translation. This need has also come up in some recognizers implemented in Gforth (e.g., rec-tick), and Gforth uses (non-standard) words like sp@ and sp! for that. Standard options would be to wrap the word that pushes a translation into catch and discard the stacks with a non-zero throw, or to use depth and fdepth in combination with loops of drop and fdrop; both ways are cumbersome. My feeling is that many in the committee and in the wider Forth community do not see the need for discard-translation yet; this may change in the future.

Consumers of translations (Usage level 5)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, so such words are only used internally in the text interpreter, eliminating the need to standardize them.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Typical use:

s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack

: umin ( u1 u2 -- u )
  2dup u< if drop else nip then ;

: string-prefix? ( c-addr1 u1 c-addr2 u2 -- f )
    tuck 2>r umin 2r> compare 0= ;

: rec-tick ( addr u -- translation )
    2dup "`" string-prefix? if
        1 /string find-name dup if
            name>interpret translate-cell
        else
            drop translate-none then
        exit then
    \ this recognizer did not recognize anything, therefore:
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-cell

EkkehardSkirlavatar of EkkehardSkirl

Remarks on: Solution

  • List of five usage level - Level 2:

    • We should use to name 'recognizer sequences' consequently recs-xxx to name recognizer sequensces and rec-xxx to name recognizers. So 'rec-forth' should be 'recs-forth' (Naming)
  • List of five usage level - Level 3:

    • On the first glance I was confused abaut the suddenly appearing 'translation token', since the preamble talked only about 'translation'. So it could be more clear to append a partial sentence to 'The result of the matching recognizer is a translation, an on-stack representation of the word or literal.' like 'The result of the matching recognizer is a translation, an on-stack representation of the word or literal, consistimng of a translation token and possibly additional data.'

    • The currently proposed names of the translation token tell me, that there is an action, but it is a kind of type. So they should called more readable 'translation-XXX' and since we use the short 'rec' for 'recognizer' it is consquently to call them shorter, for instance 'trl-XXX' (Naming)

Remarks on: Proposal:

Usage requirements:

Translations and text-interpretation

The second paragraph sounds a little bit confusing to me.

'All the proposed-standard translate-... words only push a translation token, and that stack effect is given, but in addition the definitions of these words also show a "Stack effect to produce a translation"; ...'

May be this confusion is based on my less knowledge of the english common speech.

But why should a word push a 'stack effect' or an information about existing stack effects, with what a kind of purpose? This translation tokens are identifier to know about which recognizer accepted the given string token and I think used to compare. And of course, the documentation of the recognizer that produces this token must document this stack effects?

This text may be written as poposed here:

'All the proposed-standard translate-... words only push a translation token. In addition the definitions of these words shall also contain a "Stack effect to produce a translation" (for instance as a comment in its colon definition); ... '

AntonErtlavatar of AntonErtl

Tests

I did not want to repost everything just to post the current status of the tests, so here they are just as a reply:

You can find tests here. As they currently are, they work on Gforth, but there seems to be a bug in either the tests or in Gforth that appears when you uncomment one or both of the commented-out tests (search for \ t{).

Reply New Version