Proposal: Recognizer

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

BerndPaysanavatar of BerndPaysan RecognizerProposal2020-07-20 20:36:30

Forth Recognizer -- Request For Discussion

  • Author: Matthias Trute
  • Version: 4
  • Date: 2 August 2018
  • Status: Final (Committee Supported Proposal)

Change history

  1. 2014-10-03 Version 1 - initial version.
  2. 2015-05-17 Version 2 - extend rationale, added ' and [']
  3. 2015-12-01 Version 3 - separate use cases, minor changes for nested recognizer stacks. New POSTPONE action.
  4. 2018-07-24 Version 4 - Clarifications, Fixing typos, added test cases

Change history, details

  1. 2016-09-18 Added more test cases
  2. 2016-09-25 Clarify that >IN is unchanged for an REC-FAIL (RECTYPE-NULL) result.
  3. 2016-10-21 simpler reference implementation
  4. 2016-11-05 first attempt to rename keywords and concept names
  5. 2017-05-15 discussion of LOCATE
  6. 2017-08-08 move example recognizers to discussion/rationale section.
  7. 2017-09-12 renamed keywords in XY.6.1 as suggested by the Forth 200x committee
  8. 2017-12-06 changed wording from "recognizer stack" to "recognizer sequence".
  9. 2017-12-10 created Recognizer EXT section with recognizer sequence management words.
  10. 2018-04-09 expanded EXT section with RECTYPE* words
  11. 2018-05-11 add comments about recognizable?
  12. 2018-07-23 finalized
  13. 2018-07-24 small bugfixes
  14. 2018-08-02 split document into proposal and comments

Problem

The Forth compiler can be extended easily. The Forth interpreter however has a fixed set of capabilities as outlined in section 3.4 of the standard text: Words from the dictionary and some number formats.

It's not possible to use the Forth text interpreter in an application or system extension context. Most interpreters in existing systems use a number of hooks to extent the interpreter. That makes it possible to use a loadable library to implement new data types to be handled like the built-in ones. An example are the floating point numbers. They have their own parsing and data handling words including a stack of their own.

Furthermore applications need to use system provided and system specific words or have to re-invent the wheel to get numbers with a sign or hex numbers with the $ prefix. The building blocks (FIND, COMPILE,, >NUMBER etc) are available but there is a gap between them and what the Forth interpreter already does.

To actually handle data in the Forth context, the processing actions need to be STATE aware. It would be nice if the Forth text interpreter, that maintains STATE, is able to do the data processing without exposing STATE to the data handling methods. These different methods need to be registered somehow.

Solution

The monolithic design of the Forth interpreter is factored into three major blocks: First the interpreter. It maintains STATE and organizes the work. Second the actual data parsing. It is called from the interpreter and analyses strings (sub-strings of SOURCE) if they match the criteria for a certain data type. These parsing words are grouped to achieve an order of invocation. The result of the parsing words is handed over to the interpreter with data specific handling methods. There are three different methods for each data type depending on STATE and to POSTPONE the data.

The combination of a parsing word and the set of data handling words to deal with the data is called a recognizer. There is no strict 1:1 relation between the parsing words and the data handling sets. A data handling set for e.g. single cell numbers can be used by different parsing words.

Whenever the Forth text interpreter is mentioned, the standard words EVALUATE (CORE), ' (tick, CORE), INCLUDE-FILE (FILE), INCLUDED (FILE), LOAD (BLOCK) and THRU (BLOCK) are expected to act likewise. This proposal is not about to change these words, but to provide the tools to do so. As long as the standard feature set is used, a complete replacement with recognizers is possible.

This proposal is about the building blocks.

Proposal

XY. The optional Recognizer word set

XY.1 Introduction

The recognizer concept consists of two elements: parsing words that return data type information that identify the parsed data and provide methods to perform the various semantics of the data: interpret, compile and postpone. A parsing word can return different data type information. A particular data type information can be used by different parsing words.

A system provided data type information is called RECTYPE-NULL. It is used if no other one is applicable. This token is associated with the system error actions if used in step e) of the text interpreter (see Appendix). It is used to achieve the action d) of the section 3.4 text interpreter.

A recognizing word within the recognizer concept has the stack effect

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )

This recognizing word must not change the string. When it is called from the interpreter, it may access SOURCE and, if applicable, even change >IN. If >IN is not used, any string may serve as input, otherwise "addr/len" is assumed to be a substring of the buffer SOURCE.

"ix" is the result of the recognizing action of the string "addr/len". RECTYPE-SOMETYPE is the data type id that the interpreter uses to execute the interpret, compile or postpone actions for the data `ix`.

All three actions are called with the "i*x" data as left from the recognizing word and are generally expected to consume it. They can have additional stack effects, depending on what RECTYPE-SOMETYPE-METHOD actually does.

RECTYPE-SOMETYPE-METHOD ( ... i*x -- j*y )

The data "i*x" doesn't have to be on the data stack, it can be at different places, if applicable. E.g. floating point numbers have a stack of their own. In this case, the data stack contains the RECTYPE-SOMETYPE information only.

XY.2 Additional terms and notations

Data type id A cell sized number. It identifies the data type and a method set to perform the data processing in the text interpreter. The actual numeric value is system specific.

Recognizer A string parsing word that returns a data type id together with the parsed data if successful. The string parsing word is assumed to run within the Forth interpreter and can access SOURCE and >IN.

Recognizer Sequence An ordered set of recognizers. It is identified with a cell sized numeric id.

XY.3 Additional usage requirements

XY.3.1 Data type id

A data type id is a single cell value that identifies a certain data type. Append table the following table to table 3.1

Symbol Data type Size on Stack
dt data type id 1 cell

XY.4 Additional documentation requirements

XY.4.1 System documentation

XY.4.1.1 Implementation-defined options

No additional options.

XY.4.1.2 Ambiguous conditions
  • Change of the content of the parsed string during parsing.

XY.4.2 Program documentation

No additional dependencies.

XY.5 Compliance and labeling

The phrase "Providing the Recognizer word set" shall be appended to the label of any standard system that provides all of the Recognizer word set.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( -- rec-seq-id ) RECOGNIZER
A system VALUE with a recognizer sequence id.

It is VALUE that can be changed with TO to assign a new recognizer set. This change has immediate effect.

This recognizer set shall be used in all system level words like EVALUATE, LOAD etc.

RECOGNIZE ( addr len rec-seq-id -- i*x RECTYPE-DATATYPE | RECTYPE-NULL ) RECOGNIZER \

Apply the string at "addr/len" to the elements of the recognizer set identified by rec-seq-id. Terminate the iteration if either a parsing word returns a data type id that is different from RECTYPE-NULL or the set is exhausted. In this case return RECTYPE-NULL.

"i*x" is the result of the parsing word. It represents the data from the string. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly.

RECTYPE>COMP ( RECTYPE-DATATYPE -- XT-COMPILE ) RECOGNIZER \

Return the execution token for the compilation action from the recognizer date type id.

RECTYPE>INT ( RECTYPE-DATATYPE -- XT-INTERPRET ) RECOGNIZER
Return the execution token for the interpretation action from the recognizer data type id.

RECTYPE>POST ( RECTYPE-DATATYPE -- XT-POSTPONE ) RECOGNIZER
Return the execution token for the postpone action from the recognizer data type id.

RECTYPE-NULL ( -- RECTYPE-NULL ) RECOGNIZER
The null data type id. It is to be used if no other data type id is applicable but one is needed. Its associated methods perform system specific error actions. The actual numeric value is system dependent.

RECTYPE: ( XT-INTERPRET XT-COMPILE XT-POSTPONE "name" -- ) RECOGNIZER
Skip leading space delimiters. Parse name delimited by a space. Create a data type id under the name name and associate the three execution tokens.

The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data i*x that e.g. RECOGNIZE has returned.

The word behind XT-INTERPRET shall have the stack effect ( ... i*x -- j*y ). The words behind XT-COMPILE and XT-POSTPONE shall consume i*x.

The execution time of name leaves a cell sized token on the data stack that can be applied to the RECTYPE>* words.

YZ.6.2 Recognizer Extension Words

A Forth system that uses recognizers in the core has words for numbers and dictionary look-ups. They shall be named as shown in the table:

Name Stack effect
`REC-NUM` `( addr len -- n RECTYPE-NUM | d RECTYPE-DNUM | RECTYPE-NULL )`
`REC-FLOAT` `( addr len -- RECTYPE-FLOAT | RECTYPE-NULL ) (F: -- f | )`
`REC-FIND` `( addr len -- XT +/-1 RECTYPE-XT | RECTYPE-NULL )`
`REC-NT` `( addr len -- NT RECTYPE-NT | RECTYPE-NULL )`

The recognizer type names, if available, shall be as shown in the table below:

Name Stack items Comment
`RECTYPE-NUM` `( -- n RECTYPE-NUM)` single cell number
`RECTYPE-DNUM` `( -- d RECTYPE-DNUM)` double cell number
`RECTYPE-FLOAT` `( -- RECTYPE-FLOAT)` `(F: -- f )` floating point number ,
`RECTYPE-XT` `( -- XT +/-1 RECTYPE-XT)` word from the dictionary matching `FIND`
`RECTYPE-NT` `( -- NT RECTYPE-NT)` word from the dictionary with name token NT

The following words deal with changing and creating recognizer sequences.

GET-RECOGNIZER ( rec-seq-id -- rec-n .. rec-1 n ) RECOGNIZER EXT
Copy the recognizer sequence rec-1 .. rec-n to the data stack. The element rec-1 is the first in the sequence.

The source is unchanged.

SET-RECOGNIZER ( rec-n .. rec-1 n rec-seq-id -- ) RECOGNIZER EXT \

Replace the recognizer sequence identified by `rec-seq-id` with a new set of `n` recognizers `rec-x`.

If the capacity of the destination sequence is too small to hold all new elements, an ambiguous situation arises.

NEW-RECOGNIZER-SEQUENCE ( size .. rec-seq-id ) RECOGNIZER EXT
Create a new, empty recognizer sequence with at least `size` elements.
### XY.7 Reference Implementation ### Basic recognizer sequence module. It is implemented as a separate stack.
: STACK ( size -- stack-id )
    1+ ( size ) CELLS HERE SWAP ALLOT
    0 OVER ! \ empty stack
;

: SET-STACK ( item-n .. item-1 n stack-id -- )
  2DUP ! CELL+ SWAP CELLS BOUNDS
  ?DO I ! CELL +LOOP ;

: GET-STACK ( stack-id -- item-n .. item-1 n )
   DUP @ >R R@ CELLS + R@ BEGIN
     ?DUP
   WHILE
     1- OVER @ ROT CELL - ROT
   REPEAT
   DROP R> ;

The recognizer sequence uses the stack module. Hence the stack-id becomes the rec-seq-id.

: NEW-RECOGNIZER-SEQUENCE STACK ;
: SET-RECOGNIZER SET-STACK ;
: GET-RECOGNIZER GET-STACK ;

\ create the default recognizer sequence
4 NEW-RECOGNIZER-SEQUENCE VALUE FORTH-RECOGNIZER

\ create a simple 3 element structure
: RECTYPE: ( XT-INTERPRET XT-COMPILE XT-POSTPONE "<spaces>name" -- )
   CREATE SWAP ROT , , ,
;

\ decode the data structure created by RECTYPE:
: RECTYPE>POST ( RECTYPE-TOKEN -- XT-POSTPONE ) CELL+ CELL+ @ ;
: RECTYPE>COMP ( RECTYPE-TOKEN -- XT-COMPILE  )       CELL+ @ ;
: RECTYPE>INT  ( RECTYPE-TOKEN -- XT-INTERPRET)             @ ;

\ the null token
:NONAME -1 ABORT" FAILED" ; DUP DUP RECTYPE: RECTYPE-NULL

\ depends on the stack implementation
: RECOGNIZE   ( addr len rec-seq-id -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
    DUP >R @
    BEGIN
      DUP
    WHILE
      DUP CELLS R@ + @
      2OVER 2>R SWAP 1- >R
      EXECUTE DUP RECTYPE-NULL <> IF
        2R> 2DROP 2R> 2DROP EXIT
      THEN
      DROP R> 2R> ROT
    REPEAT
    DROP 2DROP R> DROP RECTYPE-NULL
;

A.XY Informal Appendix

A.XY.1 Text Interpreter

The Forth text interpreter can be changed into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it. The example is a full replacement if all necessary recognizers are available.

The algorithm of the Forth text interpreter as described in section 3.4 is modified. All subsections of 3.4 apply unchanged. Change the steps b) and c) from section 3.4 to make them optional, they can be performed with recognizers. Replace the step d) with the following steps d) to f)

  1. For each element of the recognizer sequence provided by FORTH-RECOGNIZER, starting with the top element, call its parsing method with the sub-string "name" from step a).

    Every parsing method returns an information token and the parsed data from the analyzed sub-string if successful. Otherwise it returns the system provided failure token RECTYPE-NULL and no further data.

    Continue with the next element in the recognizer set until either all are used or the information token returned from the parsing word is not the system provided failure token RECTYPE-NULL.

  2. Use the information token and do one of the following

    1. if interpreting execute the interpret method associated with the information token.
    2. if compiling execute the compile method associated with the information token.
  3. Continue with a)

: INTERPRET
  BEGIN
      PARSE-NAME DUP
  WHILE
      FORTH-RECOGNIZER RECOGNIZE
      STATE @ IF RECTYPE>COMP ELSE RECTYPE>INT THEN
      EXECUTE
      ?STACK  \ simple housekeeping
  REPEAT 2DROP
;

A.XY.2 POSTPONE

POSTPONE compiles the data returned by RECOGNIZE (i*x) into the dictionary as literal(s) and appends the compilation action of the RECTYPE-TOKEN data type id. Later at run-time the i*x data is read back and the compilation action is performed like it would have been called directly at compile time.

: POSTPONE ( "name" -- )
  PARSE-NAME FORTH-RECOGNIZER RECOGNIZE DUP >R
  RECTYPE>POST EXECUTE R> RECTYPE>COMP COMPILE, ;

This implementation assumes a system that uses recognizers only.

A.XY.3 Test Cases

The test cases assume a stack to implement the recognizer set.

T{ 4 NEW-RECOGNIZER-SEQUENCE constant RS -> }T

T{ :NONAME 1 ;  :NONAME 2 ;  :NONAME 3  ; RECTYPE: rectype-1 -> }T
T{ :NONAME 10 ; :NONAME 20 ; :NONAME 30 ; RECTYPE: rectype-2 -> }T

T{ : rec-1 NIP 1 = IF rectype-1 ELSE RECTYPE-NULL THEN ; -> }T
T{ : rec-2 NIP 2 = IF rectype-2 ELSE RECTYPE-NULL THEN ; -> }T

T{ rectype-1 RECTYPE>INT  EXECUTE -> 1 }T
T{ rectype-1 RECTYPE>COMP EXECUTE -> 2 }T
T{ rectype-1 RECTYPE>POST EXECUTE -> 3 }T

\ testing RECOGNIZE
T{         0 RS SET-RECOGNIZER -> }T
T{ S" 1"     RS RECOGNIZE   -> RECTYPE-NULL }T
T{ ' rec-1 1 RS SET-STACK -> }T
T{ S" 1"     RS RECOGNIZE   -> rectype-1 }T
T{ S" 10"    RS RECOGNIZE   -> RECTYPE-NULL }T
T{ ' rec-2 ' rec-1 2 RS SET-STACK -> }T
T{ S" 10"    RS RECOGNIZE   -> rectype-2 }T

The dictionary lookup has the following test cases

T{ S" DUP" REC-FIND  -> ' DUP -1 RECTYPE-XT }T
T{ S" UNKOWN WORD" REC-FIND -> RECTYPE-NULL }T

The number recognizer has the following checks

VARIABLE OLD-BASE BASE @ OLD-BASE !

T{ : S-1234 S" 1234" ; -> }T
T{ : D-1234 S" 1234." ; -> }T
T{ : S-UNKNOWN S" unknown word" ; -> }T
T{ : S-DUP  S" DUP" ; -> }T

T{ S-1234 FORTH-RECOGNIZER RECOGNIZE -> 1234  RECTYPE-NUM   }T
T{ D-1234 FORTH-RECOGNIZER RECOGNIZE -> 1234. RECTYPE-DNUM  }T
T{ S-DUP  FORTH-RECOGNIZER RECOGNIZE -> ' DUP -1 RECTYPE-XT }T
T{ S-UNKNOWN FORTH-RECOGNIZER RECOGNIZE  -> RECTYPE-NULL }T
T{ S" %-10010110" REC-NUM -> -150 RECTYPE-NUM }T
T{ S" %10010110"  REC-NUM ->  150 RECTYPE-NUM }T
T{ S" 'Z'"    REC-NUM -> char Z RECTYPE-NUM }T
T{ S" ABCXYZ" REC-NUM -> RECTYPE-NULL }T

\ check whether BASE is unchanged
T{ BASE @ OLD-BASE @ = -> -1 }T

Floating point numbers are handled likewise

T{ : S-1234e5 S" 1234e5" ; -> }T
T{ S-1234e5 REC-FLOAT -> 1234e5 RECTYPE-FLOAT }
T{ S-1234e5 FORTH-RECOGNIZER RECOGNIZE -> 1234e5 RECTYPE-FLOAT }T

Experience

First ideas to dynamically extend the Forth text interpreter were published in 2005 at comp.lang.forth by Josh Fuller and J Thomas: Additional Recognizers?

A specific solution to deal with number prefixes was roughly sketched by Anton Ertl at comp.lang.forth in 2007 with https://groups.google.com/forum/#!msg/comp.lang.forth/r7Vp3w1xNus/Wre1BaKeCvcJ

There are a number of specific solutions that can at least partly be seen as recognizers in various Forth's:

  • prefix-detection in ciforth
  • W32Forth uses its "chain" concept to achieve similar effects.
  • various commercial Forth's seem to have ways to extent the interpreter.
  • FICL, a system close to Forth, has parse-steps since approx 2001.

A first generic recognizer concept was implemented in amforth version 4.3 (May 2011). The design presented in this RFD is implemented with version 5.3 (May 2014). gforth has recognizers since 2012, the ones described here since June 2014.

Existing recognizers cover a wide range of data formats like floating point numbers and strings. Others mimic the back-tick syntax used in many Unix shells to execute OS sub-process. A recognizer is used to implement OO notations.

Most of the small words that constitute a recognizer don't need a name actually since only their execution tokens are used. For the major words a naming convention is suggested: REC-<name> for the parsing word, and RECTYPE-<name> for the data type word created with RECTYPE: for the data type "name".

Acknowledgments

The following people did major or minor contributions, in no particular order.

  • Bernd Paysan
  • Jenny Brien
  • Andrew Haley
  • Alex McDonald
  • Anton Ertl
  • Forth 200x Committee

BerndPaysanavatar of BerndPaysan

There are a number of discussions going on elsewhere, and I want to make sure the discussion is going to be here. So for a start, I just took Matthias' version D proposal.

Ideas discussed:

  • Bikeshedding the names again (we already did that).
  • Revert the postpone action to version A
  • Introduce RECOGNIZER-SEQUENCE: word, which creates a new recognizer that actually consists of a sequence of those.

Please feel free to introduce these ideas here.

AntonErtlavatar of AntonErtl

The "RECOGNIZER-SEQUENCE:" proposal is now online: Nestable Recognizer Sequences.

ruvavatar of ruv

There are a number of discussions going on elsewhere, and I want to make sure the discussion is going to be here.

There are too many questions to discuss all of them on a single page.

I opened several issues on the subject of review API v4. Please check.

Some notable of them:

  • side effects are not acceptable (Issue #7)
  • arguments against VALUE in API (Issue #5)
  • accessors are not needed (Issue #6)
  • new arguments concerning "unrecognized vs zero" (Issue #4)
  • choosing better names (Issue #3)

I suggest the following road map:

  1. Terms definitions and data types
  2. A minimal essential part
  3. Choose among the different approaches (e.g., I suggest to discard both postponing and reproducing actions in the public API).
  4. Choose solutions for some problems
  5. Choose names

ruvavatar of ruv

Take a note on Comparison of terminology to some past proposals.

For example, it shows that "to translate" is important and very useful term.

UlrichHoffmannavatar of UlrichHoffmann

I adapted the suggested term changes of wrt to the Recognizer-A-Proposal to the latest D-Proposal:

| Term in proposal D        | Term suggested                     | comment                               |
| ------------------------- | ---------------------------------- | ------------------------------------- |
| recognizer stack          | recognizer-order                   | similar to search-order               |
| data type id              | recognizer information token (rit) | explicit and consistent               |
| RECOGNIZE                 | RECOGNIZE                          |                                       |
| RECTYPE:                  | RECOGNIZER                         | similar to WORDLIST, no defining word |
| RECTYPE-NULL              | UNRECOGNIZED                       | better english                        |
| REC-SOMETYPE              | recognize-xxx                      | better english                        |
| RECTYPE-SOMETYPE          | xxx-recognized                     | better english                        |

The items to be discussed mentioned in the preamble at the beginning of Recognizer RfD rephrase 2020 remain valid.

UlrichHoffmannavatar of UlrichHoffmann

The output of the scanner is often called "token class":

From Wikipedia (Lexical Analysis)

Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.

Also see "Modern Compiler Design" Second Edition, Dick Grune • Kees van Reeuwijk • Henri E. Bal • Ceriel J.H. Jacobs • Koen Langendoen they use the same term.

So it might be reasonable to call RECTYPE: just TOKEN-CLASS.

UlrichHoffmannavatar of UlrichHoffmann

or maybe TOKENCLASS ?

ruvavatar of ruv

TL;DR: token class from lexers is a wrong association.

The "token" term in lexical analysis

1. in computer science, the token term is used in various meanings.

2. In lexical analysis (and compilers theory), token is actually a shorthand for lexical token. (It's the same as in the Forth topic, the "definition" term is a shorthand for "Forth definition").

3. A lexical token is a tuple of a lexeme and the kind of this lexeme
(it's my rewording from Wikipedia, and also: "The lexeme's type combined with its value is what properly constitutes a token").

A lexical token usually doesn't bear any semantic information, it bears only lexical kind — it it an identifier, a number, string literal, or a particular key word.

Lexical tokens are not used in Forth since Forth doesn't distinguish lexemes by the different kinds.

4. In Forth, the token term is not used in the sense of lexical token. As we can conclude from the "execution token" and "name token" terms, in the Forth standard a token is just a kind of identifier, symbol, or something that represents something another (in general case; but numbers can represent themselves).

Lexical token class

The output of the scanner is often called "token class"

It's not quite correct. Output of the scanner (i.e. a lexer) is a sequence of lexical tokens.

Concerning "token class" — it is a shorthand for "lexical token class".

And actually, token class, token type, token category, token name (in the famous Dragon book) — all of them refers to the same thing, that I have called lexeme kind above.

Qualification of the tokens in Forth

We need to name the entity that qualifies a token.

We can call it "token class" (but without referring to "token class" in lexers). Initially I called this entity "token type". But then I realized that "token descriptor" is far better.

This entity can be created in run time, and it describes how to translate a token, — it is not an abstraction, it is an actual object (and the identifier of this object). So "create descriptor" shorthand sounds better than "create class" or "create type" (that look as more abstract things).

Only when we find names for the terms (in the human language, in the language of the standard ), we can find good names for words.

AntonErtlavatar of AntonErtl

In Forth we have a number of tokens already: execution token, name token. Therefore, the terminology of a book about compilers would be misleading in Forth (cf. the meaning of "word" in a computer architecture book).

Moreover, different recognizers often produce the same rectype. E.g., in Gforth, rectype-num is returned by at least rec-num, rec-tick, and rec-body. So the role of a rectype is different from that of a token class in Grune's book.

One advantage that the rectype-* names have over tokenclass-* is that the association with recognizers is more obvious.

Finally, the committee already had a discussion on the names, and agreed on the rectype-* names. As far as I am concerned, that is good enough. And in particular, I think that the incremental benefits of better names (even if one could be agreed upon, which is doubtful) are not worth the costs of trying to find it, finding consensus on it, changing the existing code and documentation, and making existing discussions about this concept harder to understand.

ruvavatar of ruv

One advantage that the rectype-* names have over tokenclass-* is that the association with recognizers is more obvious.

But it should have association with tokens, not with recognizers!

What is etymology of "rectype" ?

I see the following disadvantages of "rectype":

  1. It's not an English word; it's an abbreviation, and it isn't explained.
  2. "rec" makes the first association with "record" that nothing to do with recognizers.
  3. "rectype" makes the first association that it's a type of record, that is wrong. The second association is that it's a type of a recognizer — that is also wrong.
  4. "rectype" describes a token, not a recognizer, but it refers to a recognizer.

E.g., why does the entity that describes "execution token" is called "rectype of execution token"? (My suggestion is: "descriptor of execution token").

not worth the costs of trying to find it, finding consensus on it, changing the existing code and documentation,

When we make a mistake, we pay for this mistake. "rectype" is a mistake in a name choice, — it seems, the most of us (who works on recognizers proposals) understand it, but didn't want to find the better name in an earlier stage, and postponed this choice to a later stage. And now you say this is not worth to change this name.

So we should have make this correction earlier. And now, I believe, we should pay the price, and this price worth this mistake.

Actually, the cost of changing the code is a weak argument. The internal code is not required to be updated (it's enough to make synonyms), and it also can be updated via auto-replacing.

The cost of changing the documentation is even a more weak argument. It can be updated via auto-replacing. But its terminology is wrong in any case, and this terminology should be fixed manually in the far more places.

Concerning "trying to find it, finding consensus on it" — we didn't even try it: we have only two alternative suggestions (perhaps, only one already). And nobody said that "rectype" by itself is better than "token descriptor" for our purpose.

AntonErtlavatar of AntonErtl

A proof-of-concept dot-parser recognizer

There have been questions about writing a dot-parser as a recognizer, in particular wrt. POSTPONE and the way that works in the present proposal. Here I present a proof-of-concept of such a recognizer.

A proof-of-concept ClassVFX subset and its dot-parser

A dot-parser is used in the context of a way to control name spaces, as is typical for an object-oriented package. So first we implement a subset of ClassVFX (which has a dot-parser), without implementing the ClassVFX syntax exactly. Here's an example of it's usage:

type: point:
  int: x
  int: y
end-type

type: line:
  point: start
  point: end
end-type

instance: point: p1
instance: point: p2
instance: line: l1

This is essentially a structure package. The fields x and y are in a wordlist private to point:; likewise, start and end are private to line:. So how do we access these fields? We access them by writing point:.x and line:.start. The dot-parser is responsible for recognizing these "words".

In order to avoid having to write line:.start @ point:.x, you can combine these into line:.start.x (for arbitrarily long sequences).

In this proof-of-concept, the start and end fields in line: contain the address of a point: instance, not the point: itself. For the recognizer, this is the harder problem, because the dot-expression corresponds to a sequence like offset1 + @ offset2 + @ offset3 + (with the length of the sequence depending on the number of dots), while with the other variant a word equivalent to offset + would be sufficient.

You can find the complete implementation with examples (tested on gforth 0.7.9_20200917) in dot-parser.fs. Below you find the implementation of the recognizer.

Dot-parser implementation

The questions were about writing the dot-parser recognizer, so here I'll explain the code in more detail.

A central point is how to represent the dot-parsed "word". I represent it as a sequence of xts plus an integer sprecifying the number of xts in the sequence. E.g., line:.start.x is represented by the xts of start @ x (where start is in the line:-private wordlist, and x in the point:-private wordlist) followed by 3 (the count).

The recognizer itself has the stack effect

rec-dot-parser ( c-addr u -- xt1 ... xtn n rectype-dp | rectype-null )

For the rectype-dp we need three actions. The first is the action for interpreting the "word"; it just executes the xts, starting with xt1:

: dp-int ( ... xt1 .. xtn n -- ... )
    \ remove xt1 .. xtn n from the data stack, then execute xt1 .. xtn.
    dup if
    swap >r 1- recurse r> execute exit then
    drop ;

The action for compiling the "word" compiles these xts into the current definition:

: dp-comp ( xt1 .. xtn n -- )
    \ compile, xt1 .. xtn, in this order
    dup if
    swap >r 1- recurse r> compile, exit then
    drop ;

POSTPONE compiles dp-comp into the current definition (rather than executeing it). But because this compiled dp-comp needs the xts at a different time than when it is available from the recognizer, we need a literal-like action to get the xts from the recognizer time to the time when this dp-comp finally runs. This literal-like action is called the "postpone action" in the proposal. Anyway, its implementation is similar to that of dp-int and dp-comp:

: dp-lit1 ( x1 .. xn n -- )
    \ compile x1 .. xn as literals, in this order
    dup if
    swap >r 1- recurse r> postpone literal exit then
    drop ;

t{ :noname [ 2 3 4 3 dp-lit1 ] ; execute -> 2 3 4 }t

: dp-lit ( xt1 .. xtn n -- )
    \ compile xt1 .. xtn n as literals
    dup >r dp-lit1 r> postpone literal ;

Once all these actions exist, we can define rectype-dp:

' dp-int ' dp-comp ' dp-lit rectype: rectype-dp

The recognizer itself is a more complex piece of code, but not particularly important for understanding how postponeing a dot-parsed piece of code works, so I'll not explain it in detail:

: split ( c-addr1 u1 c-addr2 u2 -- c-addr3 u3 c-addr4 u4 true | c-addr1 u1 false )
    \ If c-addr2 u2 is found in c-addr1 u1, return true, and c-addr3
    \ u3 and caddr4 u4 are the parts to the left and the right of the
    \ found string.  If not return c-addr1 u1 and false
    2over 2>r dup >r search if
    over swap r> /string 2r> 2swap 2>r drop tuck - 2r> true
    else
    r> drop 2r> 2drop false
    then ;

: rec-dot-parser ( c-addr u -- xt1 ... xtn n rectype-dp | rectype-null )
    \ this leaves out the handling of a number of cases resulting in
    \ rectype-null in the interest of showing the successful case more clearly
    s" ." split 0= if
    2drop rectype-null exit then
    2swap find-name \ !! deal with not-found and not-<typename>
    name>interpret >body @ >r 1 -rot begin ( xt1 .. xtn n c-addr1 u1 r:wid )
    s" ." split while
        2swap r> find-name-in \ !! deal with not-found and not-<fieldname>
        name>interpret dup >body cell+ @ @ >r
        -rot 2>r ['] @ rot 2 + 2r>
    repeat
    r> find-name-in \ !! deal with not-found and not-<fieldname>
    name>interpret swap rectype-dp ;

Note that this recognizer does not properly handle the cases where the string contains a dot, but should not be recognized by the dot-parser (it's a proof-of-concept).

In gforth 0.7.9_20200917, this recognizer is searched last with

' rec-dot-parser get-recognizers 1+ set-recognizers

Usage examples

Using the definitions of point:, line:, their fields, and instances l1, p1, p2, you can do:

\ interpretive uses:
p1 l1 line:.start !
8  l1 line:.start.y !

\ compiled use:
: foo line:.start.y @ ;

\ postpone use:
: bar postpone line:.start.x ; immediate
: flip bar ;

Gforth's see decompiles foo, bar, and flip as:

: foo  
  start @ y @ ;
: bar  ['] start ['] @ ['] x 3 
  dp-comp ; immediate
: flip  
  start @ x ;

ruvavatar of ruv

Just for comparison, in the minimalistic API, instead of dp-int, dp-comp, dp-lit1, dp-lit and rectype-dp words (five words in total) we should define only one general purpose word tt-nxt:

: tt-nxt ( ... xt1 .. xtn n -- ... )
    \ remove xt1 .. xtn n from the data stack, then translate xt1 .. xtn one by one in this order.
    dup if
    swap >r 1- recurse r> tt-xt exit then
    drop ;

Changes in rec-dot-parser is that rectype-dp is replaced by ['] tt-nxt.

Reply New Version