Proposal: Recognizer RfD rephrase 2020
This proposal has been moved into this section. Its former address was: /standard/intro
This page is dedicated to discussing this specific proposal
ContributeContributions
UlrichHoffmann [131] Recognizer RfD rephrase 2020Proposal2020-02-24 09:57:56
Recognizer RfD rephrase 2020
Author: Ulrich Hoffmann
Contact: uho@xlerb.de
Version: 0.8
Date: 2020-02-24
Status: Published
Preamble
This text is a rephrasing of just section XY.2, XY.6, section XY.7 and parts of A.XY of the original recognizer RfD [1] by Matthias Trute that uses terminology and word names closer to that already present in Forth-94 and Forth-2012.
It is not intended to invalidate the susequent RfDs B, C or D [2][3][4]. They reflect the ongoing discussion about Forth recognizers and should be considered valuable documentation of that discussion. This text however is intended to revert the recognizer proposal back to simplicity of concepts and terms making it both easier to understand and use as well as simpler to implement.
This text does not add any new functionality to the original proposal. It merely introduces different terms for the structures already existing in the original proposal. The only difference in functionality is the substitution of the defining word RECOGNIZER: of the original proposal by the word RECOGNIZER (note the missing : ) that - similar to the Forth-94 word WORDLIST - creates a recognizer information token and leaves it on the data stack.
Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.
The following table summarizes the different terms and names:
Term in original proposal | Term used here | comment |
---|---|---|
recognizer stack | recognizer-order | similar to search-order |
information token (rit) | recognizer information token (rit) | explicit and consistent |
DO-RECOGNIZER | RECOGNIZE | avoid hyphen in name |
RECOGNIZER: | RECOGNIZER | similar to WORDLIST, no defining word |
R:FAIL | UNRECOGNIZED | no : in name, better english |
REC:xxx | recognize-xxx | no : in name, better english |
R:xxx | xxx-recognized | no : in name, better english |
Items to discuss
Programs that use the word RECOGNIZE (e.g. user-defined text interpreters) most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. For these programs to be portable among standard systems appropriate access words would need to be standardized. [3] and [4] propose such words. Without these access words standardizing the word RECOGNIZE is doubtful. Only standardizing the modified (internal) text interpreter behavior would be sufficient then.
The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) create the opaque structure recognizer information token. As an alternative recognizer information tokens could be defined - similar to addresses of counted strings (c-addr) - as special addresses and the structure of memory at that address could be exposed. recognizer information token could then be created by already existing standard words such as CREATE ALLOT ALLOCATE and would have a known layout, e.g. three xts in sequence: { INTERPRET-XT | COMPILE-XT | POSTPONE-XT }. The access words of 1. would not need to be standardized as each standard program could access the xts using already existing standard words for memory acccess.
The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) despite its name does not create a recognizer (i.e. a parsing-word plus possible several recognizer information tokens) but a single recognizer information token (triple of interpret/compile/postpone xts characterized by a single-cell value). Another name might reflect this functionality better.
Changes in the standard text interpreter (i.e. that it invokes the word RECOGNIZE internally) has implication on many other words apart from MARKER (e.g. ' ['] EVALUATE INCLUDE-FILE INCLUDED ...). Changes in their behaviour should be mentioned in the propsal. [2] proposes explicit changes for ' ['] MARKER while [3] and [4] have a paragraph describing the implication generally and do not propose i.e. MARKER changes explicitly.
Recognizer information tokens (triple of interpret/compile/postpone xts characterized by a single-cell value) could be named more appropriately. [4] proposes a different name data type id that does not seem to be appropriate. Its general notion seems to mislead into the direction of Forth having a data type system.
From a classical computer science view recognizers act in the lexical analysis (scanner) phase of a compiler, operating on sequences of characters detecting appropriate lexemes (character subsequences of the input stream) and convert them to tokens. Several lexems might map to the same token (e.g. different sequences of digits map to the token NUM) along with so called attributes (e.g. the value of the number). For this reason tokens are sometimes also called token classes or token types or the kind of the token. These might be good alternative names instead of recognizer information token or data type id. Forth-94 and Forth-2012 use the term ID (as in wordlist-id or file-id) to define characterizing single-cell values so going along the xxx-id would be consistent with existing standard terms. (maybe recognizer-token-id)?
References
[1] Forth Recognizer -- Request For Discussion, Version 1, Matthias Trute, 2014-10-03, access at http://amforth.sourceforge.net/pr/Recognizer-rfc.pdf
[2] Forth Recognizer -- Request For Discussion, Version 2, Matthias Trute, 2015-09-20, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-B.pdf
[3] Forth Recognizer -- Request For Discussion, Version 3, Matthias Trute, 2016-09-04, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-C.pdf
[4] Forth Recognizer -- Request For Discussion, Version 4, Matthias Trute, 2018-08-02, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-D.pdf
Proposal
....
XY.2 Additional terms and notations
Recognizer Information Token: An implementation-dependent single-cell value that identifies the data type and a method table to perform the data processing of the interpreter. A naming convention suggests that the names end with -recognized. Recognizer Information Tokens are abbreviated rit in stack comments.
Recognizer: A combination of a text parsing word that returns recognizer information tokens together with parsed data if successful. The text parsing word is assumed to run in cooperation with SOURCE and >IN. A naming convention suggests that the names start with recognize-.
...
XY.6 Glossary
XY.6.1 Recognizer words
RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED ) RECOGNIZER
Apply the recognizers in the recognizer-order to the string at "addr/len" one after the other. Terminate the iteration if either a recognizer returns a recognizer information token rit that is different from UNRECOGNIZED or the recognizer-order is exhausted. In this case, return UNRECOGNIZED otherwise rit.
"i*x" is the result of the parsing word. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly.
It is an ambiguous condition if the recognizer-order is empty.
GET-RECOGNIZERS ( -- rec-n .. rec-1 n ) RECOGNIZER
Return the execution tokens rec-1 .. rec-n of the parsing words in the recognizer-order. rec-1 identifies the recognizer that is called first and rec-n the execution token of the word that is called last.
The recognizer-order is unaffected.
MARKER ( "<spaces>name" -- ) RECOGNIZER
Extend MARKER to include the current recognize-order in the state preservation.
UNRECOGNIZED ( -- UNRECOGNIZED ) RECOGNIZER
A constant cell sized recognizer information token with two uses: first it is used to deliver the information that a specific recognizer could not deal with the string passed to it. Second it is a predefined recognizer information token whose elements are used when no recognizer from the recognizer-order could handle the passed string. These methods provide the system error actions.
The actual numeric value is system dependent and has no predictable value.
RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) RECOGNIZER
Create a recognizer information token rit with the three execution tokens XT-INTERPRET XT-COMPILE XT-POSTPONE. The implementation is system dependent.
The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data that the associated parsing word of the recognizer returned. The information token itself is consumed by the interpreter.
SET-RECOGNIZERS ( rec-n .. rec-1 n -- ) RECOGNIZER
Set the recognizer-order to the recognizers identified by the execution tokens of their parsing words rec-n .. rec-1. rec-1 will be the parsing word of the recognizer that is called first, rec-n will be the last one.
It is an ambiguous condition, if n is not a positive number.
XY.7 Reference Implementation
\ create a simple 3 element structure
\ rit : XT-INTERPRET
\ rit CELL+ : XT-COMPILE
\ rit 2 CELLS + : XT-POSTPONE
: RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit )
HERE >R SWAP ROT , , , R> ;
\ system failure recognizer
: notfound ( i*x -- ) -13 THROW ;
' notfound ' notfound ' notfound RECOGNIZER CONSTANT UNRECOGNIZED
\ contains the recognizer-order
\ first cell is the current number of recognizers.
10 CELLS BUFFER: recognizer-order
0 recognizer-order !
: SET-RECOGNIZERS ( rec-n .. rec-1 n -- )
DUP recognizer-order !
BEGIN
DUP
WHILE
DUP CELLS recognizer-order +
ROT SWAP ! 1-
REPEAT DROP
;
: GET-RECOGNIZERS ( -- rec-n .. rec-1 n )
recognizer-order @ recognizer-order
BEGIN
CELL+ OVER
WHILE
DUP @ ROT 1- ROT
REPEAT 2DROP
recognizer-order @
;
: RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED )
recognizer-order @
BEGIN
DUP
WHILE
DUP CELLS recognizer-order + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP UNRECOGNIZED <> IF R> DROP 2R> 2DROP EXIT THEN DROP
R> 2R> ROT
REPEAT
DROP 2DROP
UNRECOGNIZED
;
POSTPONE
POSTPONE is outside the Forth interpreter:
: POSTPONE ( "\<spaces\>name" -- )
BL WORD COUNT
RECOGNIZE
2 CELLS + @ ( post ) \ get the XT-POSTPONE from recognizer
EXECUTE
; IMMEDIATE
...
A.XY Informal Annex
A.XY.1 Forth Text Interpreter
The Forth text interpreter turns into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it.
INTERPRETER
: PARSE-NAME ( -- addr u ) BL WORD COUNT ;
: INTERPRET ( addr len -- i*x rid | unrecognized )
BEGIN
PARSE-NAME ?DUP IF DROP EXIT THEN \ no more words?
RECOGNIZE
STATE @ IF CELL+ @ ( comp ) ELSE @ ( interp ) THEN \ get the right XT
EXECUTE \ do the action
?STACK \ simple housekeeping
AGAIN
;
A.XY.2 Example Recognizers
Word recognizer
\ find-name is close to FIND. amforth specific.
256 BUFFER: find-name-buf
: place ( c-addr1 u c-addr2 )
2DUP C! CHAR+ SWAP MOVE ;
: find-name ( addr len -- xt +/-1 | 0 )
find-name-buf place
find-name-buf
FIND DUP 0= IF NIP THEN ;
: immediate? ( flags -- true|false ) 0> ;
\ Define word recognizer
\ INTERPRET
:NONAME ( i*x XT flags -- j*y )
DROP EXECUTE ;
\ COMPILE
:NONAME ( XT flags -- )
immediate?
IF COMPILE, ELSE EXECUTE THEN ;
\ POSTPONE
:NONAME ( XT flags -- )
immediate?
IF COMPILE, ELSE POSTPONE LITERAL POSTPONE COMPILE, THEN ;
RECOGNIZER CONSTANT word-recognized
\ parsing word for word recognizer
: recognize-word ( addr len -- XT flags rid | UNRECOGNIZED )
find-name ( addr len -- XT flags | 0 )
?DUP IF word-recognized ELSE UNRECOGNIZED THEN ;
\ prepend the word recognizer to the recognizer-order
GET-RECOGNIZERS ' recognize-word SWAP 1+ SET-RECOGNIZERS
end of document