Proposal: Recognizer committee proposal 2025-09-11

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl [412] Recognizer committee proposal 2025-09-11Proposal2025-09-12 03:56:07

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see translate-name). (formerly called rec-nt). If not, translation is translate-none.

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ..., translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see translate-none. (formerly known as notfound and r:fail)

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

AntonErtlavatar of AntonErtlNew Version: Recognizer committee proposal 2025-09-11

Hide differences

The committee has found consensus on the following words. I was asked to write it up as a proposal, quickly. Due to time limits this is just a skeleton, and will not make sense to people new to the discussion. A more fleshed-out proposal will be submitted later.

Translations and text-interpretation

Recognizers produce translations. The text interpreter (and other users, such as postpone), removes the translation from the stack(s), and then either performs the interpreting run-time, compiling run-time, or postponing run-time.

Unless otherwise specified the compiling run-time compiles the interpreting run-time. The postponing run-time compiles the compiling run-time.

Types

translation: The result of a recognizer; the input of interpreting, compiling, and postponing; it's a semi-opaque type that consists of a translation token at the top of the data stack and additional data on various stacks below.

translation token: Single-cell item that identifies a certain translation. (This has formerly been called a rectype.)

Words

rec-name ( c-addr u -- translation )

If c-addr u is the name of a visible local or a visible named word, translation represents the text-interpretation semantics (interpreting, compiling, postponing) of that word (see

translate-name). (formerly called rec-nt). If not, translation is translate-none.

translate-name). If not, translation is translate-none. (formerly called rec-nt)

rec-number ( c-addr u -- translation )

If c-addr u is a single or double number (without or with prefix), or a character, all as described in section ..., translation represents pushing that number at run-time (see translate-cell, translate-dcell). If not, translation is translate-none.

rec-float ( c-addr u -- translation )

If c-addr u is a floating-point number, as described in section ...,

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

translation represents pushing that number at run-time (see translate-float). If c-addr u has the syntax of a double number without prefix according to section ..., and it correspond to the floating-point number r corresponding to that string according to section ..., translation may represent pushing r at run-time. If c-addr u is not recognized as a floating-point number, translation is translate-none.

rec-none ( c-addr u -- translation )

This word does not recognize anything. For its translation, see

translate-none. (formerly known as notfound and r:fail)

translate-none.

recs ( -- )

Print the recognizers in the recognizer sequence in rec-forth, the first searched recognizer leftmost. (formerly known as .recognizers)

rec-forth ( c-addr u -- translation )

This is a deferred word that contains the recognizer (sequence) that is used by the Forth text interpreter. (formerly forth-recognize)

rec-sequence: ( xtu .. xt1 u "name" -- )

Define a recognizer sequence "name" containing u recognizers represented by their xts. If set-recs is implemented, the sequence must be able to accomodate at least 16 recognizers.

name execution: ( c-addr u -- translation )

Execute xt1; if the resulting translation is the result of translate-none, restore the data stack to ( c-addr u -- ) and try the next xt. If there is no next xt, remove ( c-addr u -- ) and perform translate-none.

translate-none ( -- translation )

(formerly r:fail or notfound)

translation interpreting run-time: ( ... -- )

-13 throw

translation compiling run-time: ( ... -- )

-13 throw

translation postponing run-time: ( ... -- )

-13 throw

translate-cell ( x -- translation )

(formerly translate-num)

translation interpreting run-time: ( -- x )

translate-dcell ( xd -- translation )

(formerly translate-dnum)

translation interpreting run-time: ( -- xd )

translate-float ( r -- translation )

translation interpreting run-time: ( -- r )

translate-name ( nt -- translation )

(formerly translate-nt)

translation interpreting run-time: ( ... -- ... )

Perform the interpretation semantics of nt.

translation compiling run-time: ( ... -- ... )

Perform the compilation semantics of nt.

translate: ( xt-int xt-comp xt-post "name" -- )

Define "name" (formerly rectype:)

"name" exection: ( i*x -- translation )

translation interpreting run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-int.

translation compiling run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-comp.

translation postponing run-time: ( ... translation -- ... )

Remove the top of stack (the translation token) and execute xt-post.

get-recs ( xt -- xt_u ... xt_1 u )

xt is the execution token of a recognizer sequence. xt_1 is the first recognizer searched by this sequence, xt_u is the last one.

set-recs ( xt_u ... xt_1 u xt -- )

xt is the execution token of a recognizer sequence. Replace the contents of this sequence with xt_u ... xt_1, where xt_1 is searched first, and xt_u is searched last. Throw ... if u exceeds the number of elements supported by the recognizer sequence.

Rationale

(This will also be fleshed out)

The committee has decided not to standardize words that consume translations for now. Such words would be useful for defining a user-defined text interpreter, but the experience with recognizers has shown that a recognizer-using text interpreter is flexible enough that it is no longer necessary to write such text interpreters, and such words are only used internally in the text interpreter.

However, to give an idea how all this works together, here's the words that Gforth provides for that purpose:

interpreting ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stack(s), then perform the interpreting run-time specified for the translation token. For a user-defined translation token, remove it from the stack and execute its int-xt.

compiling ( ... translation -- ... )

For a system-defined translation token, first remove the translation from the stacks, then perform the compiling run-time specified for the translation token, or, if none is specified, compile the 'interpreting' run-time. For a user-defined translation token, remove it from the stack and execute its comp-xt.

postponing ( ... translation -- )

For a system-defined translation token, first consume the translation, then compile the 'compiling' run-time. For a user-defined translation token, remove it from the stack and execute its post-xt.

Examples

s" 123" rec-forth ( translation ) interpreting

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then
        exit then
    \ 2drop notfound
    rec-none ;

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

EricBlakeavatar of EricBlake

I'm aware you intend to flush this out further, but when doing so, the following observations may be useful.

Examples

 s" 123" rec-forth ( translation ) interpreting

It might be helpful to also show the stack effect after interpreting, as in:

s" 123" rec-forth ( translation ) interpreting ( n ) \ leaves 123 on the stack

If I understand the intent correctly, the difference between rec-none and translate-none is that both produce the same translation (which consists of a translation token and possibly additional cells), while only the former also consumes addr u before pushing that translation. Taking it further, since translation is a semi-opaque type of one or more cells, I think I can still implement it where the translation token returned by translate-none is a literal 0 (trivially as 0 constant translate-none), provided that my interpreting/compiling/postponing all recognize a literal 0 as the built-in translation whose effects are to result in -13 throw. Or I could implement :noname -13 throw ; dup dup translate: translate-none where the translation token is a non-zero xt just like any other user-defined word created by translate:, and then interpreting/compiling/postponing don't have to special-case 0. But either way, my implementation choice for translate-none is not unduly constrained by the standard, and not relevant or visible to the user; but it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But given that analysis, It means that the standard should not prohibit a translation token from being a literal 0; when compared to the other recent work in r1533 to designate that xt => x \ flag, the standard should be clear that translation token => x and not translation token => xt.

Continuing with the examples,

: rec-tick ( addr u -- translation )
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret ( xt ) translate-num then

Why is this calling out translate-num instead of translate-cell?

        exit then

This exit leaves the stack with either xt translate-num or 0; the former makes sense but the latter assumes translate-none produces a literal 0, which I just argued above is a specific implementation choice rather than something the standard mandates. Would it be better as:

... find-name ?dup if name>interpret ( xt ) translate-cell exit then translate-none exit ...
    \ 2drop notfound
    rec-none ;

I like how rec-none serves the same role as the former 2drop notfound, but does leaving in the comment aid in understanding the example?

' noop                       ( x -- x )                             \ int-xt
' lit,                       ( compilation: x -- ; run-time: -- x ) \ comp-xt
:noname lit, postpone lit, ; ( postponing: x -- ;  run-time: -- x ) \ post-xt
translate: translate-num

Is this part of the example intended to be a potential reference implementation of translate-cell? Or is it intended to supply the translate-num used in the rec-tick above, in which case the order of presentation should be swapped?

ruvavatar of ruv

translate-cell ( x -- translation )
translate-name ( nt -- translation )

The naming scheme translate-*** is inappropriate and confusing for these words; for example, the name translate-name implies that the word performs some translation, but this word actually does not perform any translation, it is just a constant (i.e., it simply pushes a single-cell token on the stack; and this should be indicated in its stack diagram).

We should find a better naming scheme for these words.

Possible options:

  • ***-recognized
  • ***-tag or tag-*** (because effectively this value is a data type tag for a data object)
  • td-*** (from "token discriminator" or "token descriptor", similar to tag)

Other?


@EricBlake wrote:

I think I can still implement it where the translation token returned by translate-none is a literal 0

Yes, for example, in Gforth it is currently implemented this way.

Zero value on unsuccess simplifies analyzing — you can do «dup if ...» or «if ...» instead of «dup tag-none <> if ... ». If most implementations stick to this approach, it can be standardized.

Why is this calling out translate-num instead of translate-cell?

This is probably an oversight after renamings.

EricBlakeavatar of EricBlake

it DOES place constraints on the user writing their own recognizers to use rec-none or translate-none instead of hard-coding 0 in their code, if they don't want an environmental dependency on the implementation. I think I like how that turned out.

But now in trying to code something up that uses recognizers from the user's point of view, I tried to write a quick word that accepts single-cell numbers but rejects character strings that would produce a double cell or not be recognized as a number. Using gforth's implementation, it might be as simple as:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    translate-cell of ( n ) true exit endof
    translate-none of ( -- ) 0 exit endof
    translate-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation ), and my usage above did not satisfy that requirement. So, what if I modify it along these lines, to ensure that every time translate-cell is used, there is an n on the stack before-hand (and then jumping through hoops to get it back off the stack)?

: token-none ( -- token ) \ determine the token produced by translate-none
  translate-none
;
: token-cell ( -- token ) \ determine the token produced by translate-cell
  0 translate-cell dup >r interpreting drop r>
;
: token-dcell ( -- token ) \ determine the token produced by translate-dcell
  #0. translate-dcell dup >r interpreting 2drop r>
;
: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  case
    token-cell of ( n ) true exit endof
    token-none of ( -- ) 0 exit endof
    token-dcell of ( d ) 2drop 0 exit endof
    abort" unexpected translation" endcase ;

Alas, even that seems like it is not portable to the proposed spec, but has an environmental dependency on gforth's implementation. There's nothing in the proposed wording that requires the translation token cell to be identical regardless of what the rest of the overall translation represents. In fact, I see nothing that requires translate-none to produce a single cell, nor for any other given flavor of translation to occupy a consistent number of cells regardless of the value being translated. The proposal is very clear that 123 translate-cell interpreting will leave 123 on the stack, but intentionally does not state how many cells of the stack are in use in between translate-cell and interpreting (only that it is a semi-opaque type of one or more cells).

Put differently, gforth's implementation for translate-cell happens to be idempotent and produce a translation that occupies two cells (namely, the value of the cell being translated, and a single-cell translation token); but based on just the proposed spec, what would prevent an alternative implementation that has a translation occupy exactly one cell (namely, a pointer to an internal struct that wraps multiple pieces of information, including the value to push, the current line/offset of the source at the time the call to translate-cell was made in order to make for more friendly SEE output, and so on). With such an implementation, I could argue that 0 translate-cell 0 translate-cell = producing false is acceptable, because it results in two different pointers (there were two different source locations at the time of the two different invocations of translate-cell). Or what would prevent an implementation where 0 translate-cell occupies one cell, because it is a frequently-encountered and worth special-casing in the interpreter loop, vs. 123 translate-cell occupying two cells, because it is infrequently encountered?

From the user's point of view, it would be a lot more powerful if we had a guarantee that a given translate-XXX produces the same translation token at the top of the stack (even if the rest of the stack is variable-length), and that a given rec-XXX produces idempotent output (maybe with limitations on how SOURCE and >IN can be changed between the recognizer and the action on the translation). It would also be nice if we could guarantee that ALL instances of translate-XXX have the behavior of pushing a single cell to the stack, where that cell is constant for a given translate-XXX, and document that comparing translation tokens is well-defined, and that the rest of the stack diagram for a translator only matters if the resulting translation will be further passed to interpreting/compiling/postponing (ie. translate-cell drop is always unambiguous and cannot cause stack underflow, but it is ambiguous behavior to attempt translate-cell compiling if there was not an n on the stack).

Finally, it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- ), so that I could rewrite my earlier example more compactly:

: single? ( c-addr u -- n true | 0 ) \ recognize only a single-cell integer, with a flag rather than translator on success
  rec-number
  dup translate-cell = if ( n token-cell ) drop true exit then
  ( 0 | d token-dcell -- ) discard 0

ruvavatar of ruv

But re-reading the proposed specification, translate-cell has a mandated stack effect of ( n -- translation )

The idea is that translate-cell has stack effect ( -- x.td ), and ( x x.td ) is a translation. The name translate-cell with its stack diagram is very confusing. I think this proposal needed more work before publication.

See also my suggestion for naming these data types.

ruvavatar of ruv

it would also be nice if there were an easy way to discard the entire stack effect of a given translation, if the result of a recognizer produces a different translation than desired. Maybe discard ( translation -- )

Yes, it would be useful and convenient. But then we should associate more information with data type identifiers. I think I would implement this.

In this proposal, "translation" (yes, the name is inappropriate) in stack diagrams is a data type symbol for the data type that is a subtype of ( ut td ), which is a pair of ut (unqualified token) and td (token descriptor) at the top, where:

  • td => x, i.e., the token descriptor (a data type) is a subtype of the unspecified cell;
  • ut => ( F: j*r ; S: i*x ; ), i.e., the unqualified token (a data type) is an arbitrary tuple (possibly empty) of r and x values, so that r values reside on the floating point stack, x values reside on the data stack.

In the language of type theories, ( ut td ) is a dependent pair type, since each member of td is associated with some particular subtype of uq. In other words, the value of the stack parameter td determines the type of the stack parameter ut. Therefore, a member of the type ( ut td ) can be interpreted as a tagged data object, in which the value td is a tag for the value ut.

Thus, each member of td is also an identifier of some data type. The Forth system associates with this identifier information about how to translate (interpret or compiler) the members of the corresponding data type (a subtype of ut). Of course, this identifier may also be associated with information about how to remove the members of this data type from the stacks.

Since the user can effectively define own data types, we should provide a way to create a token descriptor (a members of td) and associate various information with it in several steps. The main information piece is about translation of the data type members. Information about postponing and discarding (removing from the stacks) my be optional.

Regarding terminology/naming. We can use the term "data type identifier" instead of "token descriptor", but 1) this name is longer, 2) then there will be a number of terms that look very similar: "data type", "data type symbol", "data type identifier". Therefore, I would prefer more distinguishable terms.

As an option, instead of "token descriptor" we can use "type descriptor".


An alternative solution to remove a qualified token from the stack is to determine its size. This can be done by storing the stack depth before the qualified token is placed on the stack and then calculating the change in stack depth. For example, see the word apply-recognizer-filter in my recognizer/filter.fth and its use in the word available-xt in example.text-translator.fth.

AntonErtlavatar of AntonErtl

The mentions of translate-num in the examples are oversights and should be translate-cell. The way that the example rec-tick deals with the case where the word is not found does not work with a non-zero translate-none. A correct implementation is:

: rec-tick ( addr u -- translation ) \ gforth-experimental
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret translate-cell exit then
        drop translate-none exit then
    rec-none ;

EricBlakeavatar of EricBlake

: rec-tick ( addr u -- translation ) \ gforth-experimental
   over c@ '`' = if

The \ gforth-experimental comment can be dropped. Is a recognizer guaranteed that u will be non-zero, or is this c@ at risk of reading beyond the bounds of the input argument? And if the recognizer is called on the length-1 string "`", should this example be relying on the implementation-defined results of c-addr 0 find-name (likely 0, but possibly an xt if the implementation allows for an empty-length dictionary entry)?

ruvavatar of ruv

I think, we should fix the following problems.

The term "translation"

The term translation is not suitable to denote the general type of recognizers result. Since "translation" is either an act of translating, or a product of translating (not recognizing). Even the term "recognition" is more suitable, if someone likes it.

Another possible option: "recognized", which will be used as a nominalized adjective (i.e., a noun).

We also need a separate term to denote the type of the topmost x value of a successful recognizing result.

The scheme translate-something

The naming scheme translate-something is not suitable for words that have type ( -- x ) and are constants.

  • Effectively, any member of this naming scheme is a verb phrase; this scheme was intended for words that perform translation (interpretation or compilation), which is an active action with possible side effects.
    • For example, translate-nt ( i*x nt -- j*x ).

A word that is a constant should have the name that is a noun or a noun phrase.

This naming scheme should be aligned with the corresponding general data type name/symbol.

The names get-recs and set-recs

The pair of words ( get-recs, set-recs ) is similar to the pair of standard words ( get-order, set-order ) by the form of their names, but very different conceptually, since they accept the object on the top. This is an inconsistency in naming conventions.

Better naming options are:

  • recs@ and recs!
    • "recs" in these names denotes the pair of types at once
      • the type of a data object that is fetched or stored
      • the type of a target data object
    • it is also similar on some extend to the pairs of standard words (defer@, defer!), (c@, c!), (2@, 2!)
      • see also my post on ForthHub in this regard.
  • fetch-recs and store-recs

The names translate: and rec-sequence:

The corresponding words are proposed as defining words.

Traditionally, a colon was only used in the names of standard defining words that have a counterpart word with a semi-colon in the name. So, this name is inconsistent with other names. Note that this tradition was broken bye new "*field:" words (but not +field).

  • Can we avoid a colon in the defining words that don't have a counterpart word with a semicolon?

The name rec-sequence: is too close to rec-sequence that is a member of the rec-something naming scheme. This is inconsistent and confusing.

  • A possible option: recs — an abbreviation of "recognizers sequence", which is "sequence of recognizers".
    • Maybe it is better if if this word was like wordlist, which produces a new identifier on the stack without creating a word.

ruvavatar of ruv

Erratum:

see also my post on ForthHub in this regard.

The correct link

Josefavatar of Josef

I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge. I utilized this proposal, BerndPaysan's retired recognizer proposal, FORTH Inc.'s recognizer page, and the comments here and on the mailing list.

Suggestions short summary

Remove the "translation" term because it's obfuscates the possible outputs, explained below.

Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.

A recognizer definition is proposed below.

It doesn't seem that the connection between a recognizer's pattern and the rest of the steps is really discussed. Matching a text token to the pattern is the first step. The parameters fetched according to the recognizer's pattern. The run-time action table is associated with a specific pattern parameter.

Recognizer term

From this proposal and FORTH, Inc.'s write up, the following seems to be how to design a recognizer:

  1. Determine the text pattern of the data.
    • _E.g. complex numbers follow an "a+bi" pattern.
  2. Create a pattern matching algorithm for the text pattern.
  3. Determine pattern parameters to be fetched.
  4. Determine the run-time action tables to pair with the fetched pattern parameters.

Recognizer definiton proposal: A recognizer attempts to match a text token to a pattern. A successful text token match invokes fetching the pattern parameters and the associated run-time action table. A failed matching attempt outputs a rat-none . The text interpreter (and other users, such as postpone ), utilizes the run-time action table to perform either the interpreting run-time, compiling run-time, or postponing run-time.

rec-sequence: make-rec-sequence ( xtu .. xt1 u "name" -- )

rec-name ( c-addr u -- xt rat | rat-none)

rec-num ( c-addr u -- i*x rat | rat-none)

rec-none ( c-addr u -- rat-none )

I agree with @ruv's suggestions for get-recs and set-recs, i.e. recs@ & recs!.

Translation Term

Translation seems to hide information. The relationship between the pattern parameters and the run-time action table is fixed. Because different recognizers produce different outputs, using "translation" as a catchall obscures the output, rather than listing the output i*x rat , xt rat , etc.

translation token run-time action table: Single-cell item that contains the run-time actions associated with specific pattern parameters, i.e. interpreting run-time, compiling run-time, and postponing run-time. (This has formerly been called a rectype, translation token. It's a table of run-time actions.)

translate: make-rat ( xt-int xt-comp xt-post "name" -- )

translate-word rat-word ( -- rat )

pattern parameters: is the optional set of data fetched after a successful text token match. The set is on various stacks below the run-time action table. (This could use a better name, not sure if it's really needed, but it helped my thinking.)

I walk through the examples below with the notes above.

Example: REC-NAME

FORTH, Inc. has this example:

' EXECUTE ' COMPILE, ' POSTPONE, TRANSLATE: TRANSLATE-WORD
' EXECUTE ' EXECUTE  ' COMPILE,  TRANSLATE: TRANSLATE-IMM
: REC-NAME ( c-addr len -- xt addr1 | addr2 )
    (FIND) CASE
        -1 OF  TRANSLATE-WORD  ENDOF
        1 OF  TRANSLATE-IMM  ENDOF
        0 OF  TRANSLATE-NONE  ENDOF
    ENDCASE ;

Compared to the steps above:

  1. Data to be handled is "words in general".
  2. The pattern is a word is in the dictionary.
  3. The pattern parameters fetched could be:
    • xt 1
    • xt -1
    • cddr 0
  4. Pattern parameters are associated to rats as follows:
    • 1 to TRANSLATE-WORD
    • -1 to TRANSLATE-IMM
    • 0 to TRANSLATE-NONE (originally, NOTFOUND).

(FIND) completes both Steps 2 & 3. The rat output is based on the pattern parameters fetched, not the pattern being matched.

Example: REC-TICK

From the proposal:

: rec-tick ( addr u -- translation ) \ gforth-experimental
    over c@ '`' = if
        1 /string find-name dup if
            name>interpret translate-cell exit then
        drop translate-none exit then
    rec-none ;

Walking through the steps:

  1. The data to be handled is a ticked word.
  2. The pattern is a name in the dictionary.
  3. The pattern parameters fetched by 1 /string find-name could be:
    • nt
    • 0
  4. Pattern parameters are associated to rats as follows:
    • nt to translate-cell
    • 0 to translate-none

The rat output is based on the pattern parameters fetched, not the pattern being matched. 2drop translate-none seems clearer than rec-none. I keep getting caught looking at the rec-tick example thinking "what is rec-none recognizing?"

Example Observations

  • Pattern matching and pattern parameter fetching can be combined or separate words.
  • It would be reasonable to have a failed pattern parameter fetch be an error. A pitfall of creating recognizers is ensuring there is little to no overlap of patterns.
    • E.g. 'bob is the name as defined, processed by rec-name. 'stan is ticked version of stan processed by rec-tick.
  • rec-none could be the final recognizer in recognizer sequences, exiting any further evaluation. Instead of creating a new sequence, one could move rec-none earlier in the sequence.

Thank you for reading this far, hopefully there is more food for thought, than madness.

ruvavatar of ruv

@Josef wrote:

I agree with @ruv that "translation" doesn't quite fit and finding suitable terms is a real challenge.

Because a "translation token" is a table of run-time actions, "run-time action table" (rat) would seem appropriate, explained below.

I'm making one more attempt on this matter.

The language of the Standard already uses concepts such as data object, data type, typed data object, and subtyping (see 3.1 Data types).

Using these concepts, we can describe a successful recognition result as a pair consisting of a data object and its corresponding data type.

On the stack, data types must be represented by specific identifiers, similar to how semantics elements are represented by xt identifiers. We might refer to such an identifier as a type descriptor (symbol td).

  • Note: "type descriptor" is preferred over "type identifier" because, in the language of the Standard, we will need expressions like "type descriptor td identifies ...". Using "type identifier" would lead to awkward repetitions such as "type identifier ti identifies ...".
  • Another option for this term could be "type token" (seems less preferable).

Additionally, we might define a qualified data object (symbol qdo) as a pair consisting of a data object and the type descriptor that identifies that object's data type.

  • Note. This concept should be distinguished from the existing concept of a "typed data object".

The elegance and strength of this approach lie in the following points:

  • It builds upon existing terminology, with only slight extensions.
  • It incorporates existing data type symbols into naming conventions.
  • It leverages subtyping relationships between data types to reduce redundancy (adhering to the DRY principle).

Type descriptors can be used to:

  • Translate data objects (into the body of a Forth definition when compiling or side effects when interpreting).
  • Convert data objects to different data types (casting).
    • E.g., getting xt from nt (for example, of an ordinary word only)
  • Check subtyping relationships between data types (or of a qualified data object).
  • Define new type descriptors.

These features can be designed independently of recognizers, and recognizers only rely on them when returning a qualified data object or analyzing a qualified data object from another recognizer.

Reply New Version