,---------------. | Contributions | `---------------ยด ,------------------------------------------ | 2020-02-24 09:57:56 UlrichHoffmann wrote: | proposal - Recognizer RfD rephrase 2020 | see: https://forth-standard.org/proposals/recognizer-rfd-rephrase-2020#contribution-131 `------------------------------------------ ## Recognizer RfD rephrase 2020 Author: Ulrich Hoffmann Contact: uho@xlerb.de Version: 0.8 Date: 2020-02-24 Status: Published #### Preamble This text is a rephrasing of just section XY.2, XY.6, section XY.7 and parts of A.XY of the original recognizer RfD [1] by Matthias Trute that uses terminology and word names closer to that already present in Forth-94 and Forth-2012. It is not intended to invalidate the susequent RfDs B, C or D [2][3][4]. They reflect the ongoing discussion about Forth recognizers and should be considered valuable documentation of that discussion. This text however is intended to revert the recognizer proposal back to simplicity of concepts and terms making it both easier to understand and use as well as simpler to implement. This text does *not* add any new functionality to the original proposal. It merely introduces different terms for the structures already existing in the original proposal. The only difference in functionality is the substitution of the defining word RECOGNIZER: of the original proposal by the word RECOGNIZER (note the missing : ) that - similar to the Forth-94 word WORDLIST - creates a recognizer information token and leaves it on the data stack. Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions. The following table summarizes the different terms and names: | Term in original proposal | Term used here | comment | | ------------------------- | ---------------------------------- | ------------------------------------- | | recognizer stack | recognizer-order | similar to search-order | | information token (rit) | recognizer information token (rit) | explicit and consistent | | DO-RECOGNIZER | RECOGNIZE | avoid hyphen in name | | RECOGNIZER: | RECOGNIZER | similar to WORDLIST, no defining word | | R:FAIL | UNRECOGNIZED | no : in name, better english | | REC:xxx | recognize-xxx | no : in name, better english | | R:xxx | xxx-recognized | no : in name, better english | #### Items to discuss 1. Programs that use the word RECOGNIZE (e.g. user-defined text interpreters) most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. For these programs to be portable among standard systems appropriate access words would need to be standardized. [3] and [4] propose such words. Without these access words standardizing the word RECOGNIZE is doubtful. Only standardizing the modified (internal) text interpreter behavior would be sufficient then. 2. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) create the opaque structure *recognizer information token*. As an alternative *recognizer information token*s could be defined - similar to addresses of counted strings (c-addr) - as special addresses and the structure of memory at that address could be exposed. *recognizer information token* could then be created by already existing standard words such as CREATE ALLOT ALLOCATE and would have a known layout, e.g. three xts in sequence: { INTERPRET-XT | COMPILE-XT | POSTPONE-XT }. The access words of 1. would not need to be standardized as each standard program could access the xts using already existing standard words for memory acccess. 3. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) despite its name does not create a recognizer (i.e. a parsing-word plus possible several recognizer information tokens) but a single recognizer information token (triple of interpret/compile/postpone xts characterized by a single-cell value). Another name might reflect this functionality better. 4. Changes in the standard text interpreter (i.e. that it invokes the word RECOGNIZE internally) has implication on many other words apart from MARKER (e.g. ' ['] EVALUATE INCLUDE-FILE INCLUDED ...). Changes in their behaviour should be mentioned in the propsal. [2] proposes explicit changes for ' ['] MARKER while [3] and [4] have a paragraph describing the implication generally and do not propose i.e. MARKER changes explicitly. 5. *Recognizer information tokens* (triple of interpret/compile/postpone xts characterized by a single-cell value) could be named more appropriately. [4] proposes a different name *data type id* that does not seem to be appropriate. Its general notion seems to mislead into the direction of Forth having a data type system. From a classical computer science view recognizers act in the lexical analysis (*scanner*) phase of a compiler, operating on sequences of characters detecting appropriate *lexemes* (character subsequences of the input stream) and convert them to *tokens*. Several lexems might map to the same token (e.g. different sequences of digits map to the token NUM) along with so called *attributes* (e.g. the value of the number). For this reason tokens are sometimes also called *token classes* or *token types* or the *kind* of the token. These might be good alternative names instead of *recognizer information token* or *data type id*. Forth-94 and Forth-2012 use the term ID (as in wordlist-id or file-id) to define characterizing single-cell values so going along the xxx-id would be consistent with existing standard terms. (maybe *recognizer-token-id*)? #### References [1] *Forth Recognizer -- Request For Discussion, Version 1*, Matthias Trute, 2014-10-03, access at [2] *Forth Recognizer -- Request For Discussion, Version 2*, Matthias Trute, 2015-09-20, access at [3] *Forth Recognizer -- Request For Discussion, Version 3*, Matthias Trute, 2016-09-04, access at [4] *Forth Recognizer -- Request For Discussion, Version 4*, Matthias Trute, 2018-08-02, access at --- ## Proposal .... ### XY.2 Additional terms and notations **Recognizer Information Token**: An implementation-dependent single-cell value that identifies the data type and a method table to perform the data processing of the interpreter. A naming convention suggests that the names end with *-recognized*. Recognizer Information Tokens are abbreviated *rit* in stack comments. **Recognizer**: A combination of a text parsing word that returns recognizer information tokens together with parsed data if successful. The text parsing word is assumed to run in cooperation with SOURCE and >IN. A naming convention suggests that the names start with *recognize-*. ... ### XY.6 Glossary ### XY.6.1 Recognizer words **RECOGNIZE** ( addr len -- i*x rit | UNRECOGNIZED ) RECOGNIZER Apply the recognizers in the recognizer-order to the string at "addr/len" one after the other. Terminate the iteration if either a recognizer returns a recognizer information token *rit* that is different from UNRECOGNIZED or the recognizer-order is exhausted. In this case, return UNRECOGNIZED otherwise *rit*. "i*x" is the result of the parsing word. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly. It is an ambiguous condition if the recognizer-order is empty. ---- **GET-RECOGNIZERS** ( -- rec-n .. rec-1 n ) RECOGNIZER Return the execution tokens rec-1 .. rec-n of the parsing words in the recognizer-order. rec-1 identifies the recognizer that is called first and rec-n the execution token of the word that is called last. The recognizer-order is unaffected. ---- **MARKER** ( "name" -- ) RECOGNIZER Extend MARKER to include the current recognize-order in the state preservation. ---- **UNRECOGNIZED** ( -- UNRECOGNIZED ) RECOGNIZER A constant cell sized recognizer information token with two uses: first it is used to deliver the information that a specific recognizer could not deal with the string passed to it. Second it is a predefined recognizer information token whose elements are used when no recognizer from the recognizer-order could handle the passed string. These methods provide the system error actions. The actual numeric value is system dependent and has no predictable value. ---- **RECOGNIZER** ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) RECOGNIZER Create a recognizer information token *rit* with the three execution tokens XT-INTERPRET XT-COMPILE XT-POSTPONE. The implementation is system dependent. The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data that the associated parsing word of the recognizer returned. The information token itself is consumed by the interpreter. ---- **SET-RECOGNIZERS** ( rec-n .. rec-1 n -- ) RECOGNIZER Set the recognizer-order to the recognizers identified by the execution tokens of their parsing words rec-n .. rec-1. rec-1 will be the parsing word of the recognizer that is called first, rec-n will be the last one. It is an ambiguous condition, if n is not a positive number. ### XY.7 Reference Implementation \ create a simple 3 element structure \ rit : XT-INTERPRET \ rit CELL+ : XT-COMPILE \ rit 2 CELLS + : XT-POSTPONE : RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) HERE >R SWAP ROT , , , R> ; \ system failure recognizer : notfound ( i*x -- ) -13 THROW ; ' notfound ' notfound ' notfound RECOGNIZER CONSTANT UNRECOGNIZED \ contains the recognizer-order \ first cell is the current number of recognizers. 10 CELLS BUFFER: recognizer-order 0 recognizer-order ! : SET-RECOGNIZERS ( rec-n .. rec-1 n -- ) DUP recognizer-order ! BEGIN DUP WHILE DUP CELLS recognizer-order + ROT SWAP ! 1- REPEAT DROP ; : GET-RECOGNIZERS ( -- rec-n .. rec-1 n ) recognizer-order @ recognizer-order BEGIN CELL+ OVER WHILE DUP @ ROT 1- ROT REPEAT 2DROP recognizer-order @ ; : RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED ) recognizer-order @ BEGIN DUP WHILE DUP CELLS recognizer-order + @ 2OVER 2>R SWAP 1- >R EXECUTE DUP UNRECOGNIZED <> IF R> DROP 2R> 2DROP EXIT THEN DROP R> 2R> ROT REPEAT DROP 2DROP UNRECOGNIZED ; #### POSTPONE POSTPONE is outside the Forth interpreter: : POSTPONE ( "name" -- ) BL WORD COUNT RECOGNIZE 2 CELLS + @ ( post ) \ get the XT-POSTPONE from recognizer EXECUTE ; IMMEDIATE ... ### A.XY Informal Annex #### A.XY.1 Forth Text Interpreter The Forth text interpreter turns into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it. ##### INTERPRETER : PARSE-NAME ( -- addr u ) BL WORD COUNT ; : INTERPRET ( addr len -- i*x rid | unrecognized ) BEGIN PARSE-NAME ?DUP IF DROP EXIT THEN \ no more words? RECOGNIZE STATE @ IF CELL+ @ ( comp ) ELSE @ ( interp ) THEN \ get the right XT EXECUTE \ do the action ?STACK \ simple housekeeping AGAIN ; #### A.XY.2 Example Recognizers ##### Word recognizer \ find-name is close to FIND. amforth specific. 256 BUFFER: find-name-buf : place ( c-addr1 u c-addr2 ) 2DUP C! CHAR+ SWAP MOVE ; : find-name ( addr len -- xt +/-1 | 0 ) find-name-buf place find-name-buf FIND DUP 0= IF NIP THEN ; : immediate? ( flags -- true|false ) 0> ; \ Define word recognizer \ INTERPRET :NONAME ( i*x XT flags -- j*y ) DROP EXECUTE ; \ COMPILE :NONAME ( XT flags -- ) immediate? IF COMPILE, ELSE EXECUTE THEN ; \ POSTPONE :NONAME ( XT flags -- ) immediate? IF COMPILE, ELSE POSTPONE LITERAL POSTPONE COMPILE, THEN ; RECOGNIZER CONSTANT word-recognized \ parsing word for word recognizer : recognize-word ( addr len -- XT flags rid | UNRECOGNIZED ) find-name ( addr len -- XT flags | 0 ) ?DUP IF word-recognized ELSE UNRECOGNIZED THEN ; \ prepend the word recognizer to the recognizer-order GET-RECOGNIZERS ' recognize-word SWAP 1+ SET-RECOGNIZERS ---- **end of document**