Proposal: Recognizer committee proposal 2025-09-11

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl [412] Recognizer committee proposal 2025-09-11Proposal2025-09-12 03:56:07

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see translate-name). (formerly called rec-nt). If not, translation is translate-none.

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ..., translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see translate-none. (formerly known as notfound and r:fail)

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

AntonErtlavatar of AntonErtlNew Version: Recognizer committee proposal 2025-09-11

Hide differences

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see

translate-name). (formerly called rec-nt). If not, translation is translate-none.

translate-name). If not, translation is translate-none. (formerly called rec-nt)

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ...,

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see

translate-none. (formerly known as notfound and r:fail)

translate-none.

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

EricBlakeavatar of EricBlake

I'm aware you intend to flush this out further, but when doing so, the following observations may be useful.

Examples

 s" 123" rec-forth ( translation ) interpreting

It might be helpful to also show the stack effect after interpreting, as in:

s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack

If I understand the intent correctly, the difference between rec-none and translate-none is that both produce the same translation (which consists of a translation token and possibly additional cells), while only the former also consumes addr u before pushing that translation. Taking it further, since translation is a semi-opaque type of one or more cells, I think I can still implement it where the translation token returned by translate-none is a literal 0 (trivially as 0 constant translate-none), provided that my interpreting/compiling/postponing all recognize a literal 0 as the built-in translation whose effects are to result in -13 throw. Or I could implement :noname -13 throw ; dup dup translate: translate-none where the translation token is a non-zero xt just like any other user-defined word created by translate:, and then interpreting/compiling/postponing don't have to special-case 0. But either way, my implementation choice for translate-none is not unduly constrained by the standard, and not relevant or visible to the user; but it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But given that analysis, It means that the standard should not prohibit a translation token from being a literal 0; when compared to the other recent work in r1533 to designate that xt => x \ flag, the standard should be clear that translation token => x and not translation token => xt.

Continuing with the examples,

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then

Why is this calling out translate-num instead of translate-cell?

        exit then

This exit leaves the stack with either xt translate-num or 0; the former makes sense but the latter assumes translate-none produces a literal 0, which I just argued above is a specific implementation choice rather than something the standard mandates. Would it be better as:

... find-name ?dup if name>interpret ( xt ) translate-cell exit then translate-none exit ...
    \ 2drop notfound
    rec-none ;

I like how rec-none serves the same role as the former 2drop notfound, but does leaving in the comment aid in understanding the example?

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

Is this part of the example intended to be a potential reference implementation of translate-cell? Or is it intended to supply the translate-num used in the rec-tick above, in which case the order of presentation should be swapped?

ruvavatar of ruv

translate-cell ( x -- translation )
translate-name ( nt -- translation )

The naming scheme translate-*** is inappropriate and confusing for these words; for example, the name translate-name implies that the word performs some translation, but this word actually does not perform any translation, it is just a constant (i.e., it simply pushes a single-cell token on the stack; and this should be indicated in its stack diagram).

We should find a better naming scheme for these words.

Possible options:

  • ***-recognized
  • ***-tag or tag-*** (because effectively this value is a data type tag for a data object)
  • td-*** (from "token discriminator" or "token descriptor", similar to tag)

Other?


@EricBlake wrote:

I think I can still implement it where the translation token returned by translate-none is a literal 0

Yes, for example, in Gforth it is currently implemented this way.

Zero value on unsuccess simplifies analyzing — you can do «dup if ...» or «if ...» instead of «dup tag-none <> if ... ». If most implementations stick to this approach, it can be standardized.

Why is this calling out translate-num instead of translate-cell?

This is probably an oversight after renamings.

EricBlakeavatar of EricBlake

it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But now in trying to code something up that uses recognizers from the user's point of view, I tried to write a quick word that accepts single-cell numbers but rejects character strings that would produce a double cell or not be recognized as a number. Using gforth's implementation, it might be as simple as:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    translate-cell of ( n ) true exit endof
    translate-none of ( -- ) 0 exit endof
    translate-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation ), and my usage above did not satisfy that requirement. So, what if I modify it along these lines, to ensure that every time translate-cell is used, there is an n on the stack before-hand (and then jumping through hoops to get it back off the stack)?

: token-none ( -- token ) \ determine the token produced by translate-none
  translate-none
;
: token-cell ( -- token ) \ determine the token produced by translate-cell
  0 translate-cell dup >r interpreting drop r>
;
: token-dcell ( -- token ) \ determine the token produced by translate-dcell
  #0. translate-dcell dup >r interpreting 2drop r>
;
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    token-cell of ( n ) true exit endof
    token-none of ( -- ) 0 exit endof
    token-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

Alas, even that seems like it is not portable to the proposed spec, but has an environmental dependency on gforth's implementation. There's nothing in the proposed wording that requires the translation token cell to be identical regardless of what the rest of the overall translation represents. In fact, I see nothing that requires translate-none to produce a single cell, nor for any other given flavor of translation to occupy a consistent number of cells regardless of the value being translated. The proposal is very clear that 123 translate-cell interpreting will leave 123 on the stack, but intentionally does not state how many cells of the stack are in use in between translate-cell and interpreting (only that it is a semi-opaque type of one or more cells).

Put differently, gforth's implementation for translate-cell happens to be idempotent and produce a translation that occupies two cells (namely, the value of the cell being translated, and a single-cell translation token); but based on just the proposed spec, what would prevent an alternative implementation that has a translation occupy exactly one cell (namely, a pointer to an internal struct that wraps multiple pieces of information, including the value to push, the current line/offset of the source at the time the call to translate-cell was made in order to make for more friendly SEE output, and so on). With such an implementation, I could argue that 0 translate-cell 0 translate-cell = producing false is acceptable, because it results in two different pointers (there were two different source locations at the time of the two different invocations of translate-cell). Or what would prevent an implementation where 0 translate-cell occupies one cell, because it is a frequently-encountered and worth special-casing in the interpreter loop, vs. 123 translate-cell occupying two cells, because it is infrequently encountered?

From the user's point of view, it would be a lot more powerful if we had a guarantee that a given translate-XXX produces the same translation token at the top of the stack (even if the rest of the stack is variable-length), and that a given rec-XXX produces idempotent output (maybe with limitations on how SOURCE and >IN can be changed between the recognizer and the action on the translation). It would also be nice if we could guarantee that ALL instances of translate-XXX have the behavior of pushing a single cell to the stack, where that cell is constant for a given translate-XXX, and document that comparing translation tokens is well-defined, and that the rest of the stack diagram for a translator only matters if the resulting translation will be further passed to interpreting/compiling/postponing (ie. translate-cell drop is always unambiguous and cannot cause stack underflow, but it is ambiguous behavior to attempt translate-cell compiling if there was not an n on the stack).

Finally, it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- ), so that I could rewrite my earlier example more compactly:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  dup translate-cell = if ( n token-cell ) drop true exit then
  ( 0 | d token-dcell -- ) discard 0

ruvavatar of ruv

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation )

The idea is that translate-cell has stack effect ( -- x.td ), and ( x x.td ) is a translation. The name translate-cell with its stack diagram is very confusing. I think this proposal needed more work before publication.

See also my suggestion for naming these data types.

Reply New Version