Proposal: Nestable Recognizer Sequences

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl Nestable Recognizer SequencesProposal2020-08-22 16:09:52

Nestable Recognizer Sequences

Author

M. Anton Ertl

Problem

There are similarities between a word list, a recognizer, a search order, and a recognizer sequence: All of them take a string as input, and either recognize it, or not. if they recognize it, word lists and the search order produce a name token (or an xt and an immediate flag), while recognizers and recognizer sequences produce some data and a rectype.

The similarity between wordlists and a search order has inspired the idea of nestable search orders: Several wordlists could be combined into a sequence that itself would work like a wordlist in other search orders. However, the search order words had already been standardized, so this idea never made it out of the concept stage.

The similarity between the search order and recognizer sequences has led to the present recognizer proposal containing the words GET-RECOGNIZER and SET-RECOGNIZER, which are mostly modeled on GET-ORDER and SET-ORDER.

As an alternative, this proposal proposes the idea of nestable (but not necessarily changeable) recognizer sequences.

Solution

Add the following words:

rec-sequence ( xt1 .. xtn n "name" -- )

Defines a recognizer "name".

"name" execution: ( c-addr u -- ... rectype )

Tries to recognize c-addr u using the recognizers xtn...xt1 (in this order). The first successful recognizer in the sequence returns from "name" with its result. If no recognizer succeeds, return RECTYPE-NULL.

[On the order of xts: This order is modeled on the order in the search order, but one could use the reverse order without suffering disadvantages; I am leaving this open to bikeshedding discussion.]

get-rec-sequence ( xt -- xt1 .. xtn n )

If xt refers to a recognizer sequence, return the contained recognizers. If xt refers to a deferred word, perform DEFER@ followed by GET-REC-SEQUENCE (i.e., GET-REC-SEQUENCE works through deferred words). IF xt refers to neither, return 0.

FORTH-RECOGNIZER now contains the xt of a recognizer or a rec-sequence. RECOGNIZE is unnecessary, because it's functionality is performed by running a rec-sequence. GET-RECOGNIZER, SET-RECOGNIZER, NEW-RECOGNIZER-SEQUENCE are replaced by the words above.

Typical Use

Define a recognizer sequence for the classical text interpreter:

' rec-num ' rec-nt 2 rec-sequence rec-forth-cm ( c-addr u -- ... rectype )

Extend it with FP numbers:

' rec-float ' rec-forth-cm 2 rec-sequence rec-forth ( c-addr u -- ... rectype )

Make this the text interpreter

' rec-forth to forth-recognizer

Have a dot-parser to be searched first:

' rec-forth ' rec-dot 2 rec-sequence rec-.forth ( c-addr u -- ... rectype )

Put a user-defined recognizer REC-USER behind the currently active recognizers, temporarily:

' rec-user forth-recognizer 2 rec-sequence rec-forthuser
forth-recognizer ( old )
' rec-forthuser to forth-recognizer
\ some code that uses REC-USER:
...
\ now restore the old recognizer sequence
( old ) to forth-recognizer

You can insert a recognizer in the middle of a sequence by picking the existing sequence apart and using it for constructing a new recognizer:

' rec-forth-cm get-rec-sequence swap ' rec-foo rot 1+ rec-sequence rec-FOOrth-cm

This inserts REC-FOO to be searched as second recognizer (after REC-NT). This approach has the disadvantage that you need to know pretty well what the recognizer currently contains (it shares this disadvantage with the GET-RECOGNIZER interface). It also has the disadvantage that you have no easy way to update all the recognizer sequences that contain REC-FORTH-CM. To avoid these disadvantages, you can put deferred words into recognizer sequences from the start:

: rec-nothing ( c-addr u -- rectype-null )
  2drop rectype-null ;

defer rec-foo-deferred ' rec-nothing is rec-foo-deferred

' rec-num ' rec-foo-deferred ' rec-nt 3 rec-sequence rec-forth-cm

Then you can plug in REC-FOO:

' rec-foo is rec-foo-deferred

And of course you can deactivate it later. Of course, this approach works only if you have the foresight to insert REC-FOO-DEFERRED from the start, or if you can change the source code of REC-FORTH-CM later.

An alternative would be to be able to change the rec-sequences in words defined with REC-SEQUENCE; for that we would need something like SET-REC-SEQUENCE. It's not clear to me that this is really needed, though.

Proposal

TBD (if this informal proposal is actually is popular enough to merit further development).

Existing practice

A word REC-SEQUENCE: (but without GET-REC-SEQUENCE) has been in Gforth since 2016. It has not been used; instead, the mainstream GET-RECOGNIZER SET-RECOGNIZER interface was used.

Reference Implementation

TBD

Testing

TBD

Credits

Ruvim has recently suggested something in this vein, rekindling my interest in this kind of interface.

AntonErtlavatar of AntonErtl

This proposal is intended to modify the Recognizer proposal.

ruvavatar of ruv

I strongly support this approach: to switch the recognizer that the Forth text interpreter uses, we pass the xt of another recognizer to the system.

But, as I said before, it's better to have the separate getter and setter instead of the single value that is changed via TO.

GET-REC-SEQUENCE and Co. can comprise a totally separate proposal. And it is worth to extract them into a separate proposal to make the basic proposal less in size and number of conflicts.

UlrichHoffmannavatar of UlrichHoffmann

Some remarks:

  • Is it a good idea to have defining words such as rec-sequence and id-creating words such as wordlst both in the standard? Seems to be inconsistent to me.

  • Isn't the Forth way to chain execution tokens to put them in colon definitions? Why do we need req-sequence?

ruvavatar of ruv

Isn't the Forth way to chain execution tokens to put them in colon definitions? Why do we need req-sequence?

A recognizer can be always created as a colon definition (or a noname, or a quotation).

My view is that this req-sequence is an optional helper, that can be implemented in a portable way, and may be provided by a system. This helper is useful when the new recognizer is just a kind of composition of several other recognizers. Certainly, it can be standardized, as optional word (a kind of standard library level).

ruvavatar of ruv

get-rec-sequence ( xt -- xt1 .. xtn n )
If xt refers to a recognizer sequence, return the contained recognizers. If xt refers to a deferred word, perform DEFER@ followed by GET-REC-SEQUENCE (i.e., GET-REC-SEQUENCE works through deferred words).
IF xt refers to neither, return 0.

If recognizer sequences are immutable, a recognizer that is not a sequence can be viewed as a sequence with a single element. I.e., get-rec-sequence can have effect ( xt -- xt 1 ) for such recognizer. Can it be useful?

AntonErtlavatar of AntonErtl

We have defining words and id-creating words in the standard already. Here I proposed a defining word because it's useful to have a name for the sequence, for building the next sequence. Also, named entities are useful for introspection features like ORDER. The namelessness of WORDLIST is useful when you want to use it as a building stone for, e.g., VOCABULARY. Gforth has NONAME, so in such a system you can create nameless recognizer sequences.

The interface of recognizers means that you have to perform some checking after every recognizer has been called (if successful, then return from the sequence). That's repetetive to write as a colon definition.

The result of GET-REC-SEQUENCE for a non-rec-sequence (and not deferred word) can be either way, and you can do the other from the one you have. If you want to enumerate all the base recognizers (e.g., for printing), if find the proposed version slightly easier. It's also less misleading when the passed xt is not actually a recognizer.

AntonErtlavatar of AntonErtl

Recognizer sequence building with a binary constructor

The proposal above allows you to build rec-sequences with n recognizers in order to support implementing interfaces like GET-RECOGNIZER SET-RECOGNIZER based on in. An alternative is to allow exactly two recognizers in a sequence, and build bigger sequences as a (possibly degenerate) tree of such sequence nodes. In many use cases you combine only two recognizers into a rec-sequence at one time, anyway (see the typical use above).

Whether to provide the binary constructor should produce a named or unnamed sequence is not clear to me. I'll use named sequences in the rest of this posting. Whether to pass the first recognizer on top or bottom is also unclear (top in the following). So we define

: two-recognizers ( xt1 xt2 "name" -- )
  create , ,
does>
  dup >r @ execute dup rectype-null <> if
    r> drop exit then
  r> cell+ @ execute ;

Typical Use

' rec-num ' rec-nt two-recognizers rec-forth-cm ( c-addr u -- ... rectype )
' rec-float ' rec-forth-cm two-recognizers rec-forth ( c-addr u -- ... rectype )
' rec-forth ' rec-dot two-recognizers rec-.forth ( c-addr u -- ... rectype )
' rec-user forth-recognizer two-recognizers rec-forthuser
forth-recognizer ( old )
' rec-forthuser to forth-recognizer
\ some code that uses REC-USER:
...
\ now restore the old recognizer sequence
( old ) to forth-recognizer

Discussion

TWO-RECOGNIZERS (as well as a corresponding getter and setter words) are much shorter to implement than for REC-SEQUENCE. The downside is that they cannot be used to implement GET-RECOGNIZER and SET-RECOGNIZER

ruvavatar of ruv

Binary constructor

: two-recognizers ( xt1 xt2 "name" -- )
  create , ,
does>
  dup >r @ execute dup rectype-null <> if
    r> drop exit then
  r> cell+ @ execute ;

This constructor expects that a recognizer doesn't consume ( c-addr u ) on rejection.

Otherwise (if a recognizer consumes ( c-addr u) in any case) the definition will be a bit more complex:

: two-recognizers ( xt1 xt2 "name" -- )
    create , ,
  does> ( c-addr u  a-addr-body )
    dup >r -rot 2dup 2>r rot
    @ execute dup rectype-null <> if
      rdrop rdrop rdrop exit
    then drop
    2r> r> cell+ @ execute
;

Nevertheless, I'm inclined to agree that if a recognizer consumes ( c-addr u ) in any case, it seemingly makes shorter the total lexical size of overall code.

Whether to pass the first recognizer on top or bottom is also unclear

It is more clear if they are passed left to right, i.e., we place them into the stack in the same order in which they should be executed: the first placed is executed fist, the second placed is executed second (if any), the last placed (that is topmost) is executed last.

This situation is similar to the order of local variables (in declaration): direct mapping is more clear.

JennyBrienavatar of JennyBrien

The similarity between wordlists and a search order has inspired the idea of nestable search orders: Several wordlists could be combined into a sequence that itself would work like a wordlist in other search orders. However, the search order words had already been standardized, so this idea never made it out of the concept stage.

The similarity between the search order and recognizer sequences has led to the present recognizer proposal containing the words GET-RECOGNIZER and SET-RECOGNIZER, which are mostly modeled on GET-ORDER and SET-ORDER.

At first glance, it's simple to convert a wordlist into a recognizer, so recognizer sequences would also give nestable search orders. If WORDLIST returned the xt of an anonymous recognizer... but there would still be problems deciding how to SET-CURRENT. There would still have to be a difference between recognizers that search the dictionary (called by REC-NAME or similar) and other recognizers, otherwise there can be no concept of a 'current search order'

So, do we need a FORTH-RECOGNIZER that combines the two? Is it sufficient to replace the 'word-not-found' portion of the interpreter? So far, I have only seen one use-case for a user-written recognizer to precede REC-NAME and I suspect such users would be better served by having their own interpreter loop rather than patching in to the system one. Maybe all that is needed is the ability to add a recognizer to the current stack and leave it their until it is removed by MARKER or the stack is reset by QUIT, in which case:

   : +RECOGNIZER (  _name_ -- )  ' action-of recognized two-recognizers ;

ruvavatar of ruv

Is it sufficient to replace the 'word-not-found' portion of the interpreter?

I think, no.

Some system may have word 'X. If a program have word X and a recognizer for '<ccc>, the phrase 'X in this program will be translated incorrectly when the recognizer doesn't precede "REC-NAME".

So, a program should have ability to override any system's recognizers.

Also, as I wrote before (news:news:rduhlf$hor$1@dont-email.me), it can be useful to reuse the system's interpreter loop, since otherwise too many words should be re-implemented in some cases.

Maybe all that is needed is the ability to add a recognizer to the current stack and leave it their until it is removed by MARKER or the stack is reset by QUIT

For libraries (independent modules), it's critical to have ability to revert the system's recognizer back.

JennyBrienavatar of JennyBrien

Some system may have word 'X. If a program have word X and a recognizer for '<ccc>, the phrase 'X in this program will be translated incorrectly when the recognizer doesn't precede "REC-NAME".

This runs counter to the user's expectation that they can name a definition anything printable and have it recognized.

So, a program should have ability to override any system's recognizers.

Which may raise more theoretical questions such as whether or not FIND can find locals :)

For libraries (independent modules), it's critical to have ability to revert the system's recognizer back.

As I see it there are two possible kinds of module:

  1. Included modules that search the CURRENT wordlist and add their definitions to it. They do not alter recognizers.
  2. Required modules that create their definitions on their own wordlist and add it to the search order. They may also set recognizers.

That's something we need to discuss elsewhere.

The effects of a Required module are local to the module that Requires it.

ruvavatar of ruv

This runs counter to the user's expectation that they can name a definition anything printable and have it recognized.

Don't confuse a system (and it's user) and a program (and it's user). You talk about a system's user. I talk about a program (and perhaps a user of the program).

A standard system is not allowed to recognize '<ccc> before any word. At the same time, the system is allowed to provide the word 'FOO in the FORTH-WORDLIST.

A standard program is allowed to configure the Forth text interpreter to properly translate source codes of this program (or DSL from a user of this program).

So, the program may have the word FOO and may configure recognizer for '<ccc>. If this recognizer takes control after the recognizer for Forth word, 'FOO will be resolved incorrectly in the system that provides the word 'FOO. NB: the program knows nothing about 'FOO word since it's a standard program, and the standard doesn't specify such a word.

Actually, this problem existed before recognizers too. A system may provide a word FED. A program may use hexadecimal number FED (i.e. when BASE is 16). And this standard program will be translated incorrectly in a standard system that provides a word FED, but correctly in other systems.

I think, for words it can be solved (independently of recognizers) by a kind of declaration that a program requires the standard environment. As a variant, FORTH-WORDLIST shall contains only standard words, and SYSTEM-WORDLIST may contains all other words.

Reply New Version