,---------------.
| Contributions |
`---------------´
,------------------------------------------
| 2020-07-20 20:36:30 BerndPaysan wrote:
| proposal - Recognizer
| see: https://forth-standard.org/proposals/recognizer#contribution-142
`------------------------------------------
# Forth Recognizer -- Request For Discussion
* Author: Matthias Trute
* Version: 4
* Date: 2 August 2018
* Status: Final (Committee Supported Proposal)
## Change history
1. 2014-10-03 Version 1 - initial version.
2. 2015-05-17 Version 2 - extend rationale, added ' and [']
3. 2015-12-01 Version 3 - separate use cases, minor changes for nested recognizer stacks. New `POSTPONE` action.
4. 2018-07-24 Version 4 - Clarifications, Fixing typos, added test cases
## Change history, details
1. 2016-09-18 Added more test cases
1. 2016-09-25 Clarify that `>IN` is unchanged for an `REC-FAIL` (`RECTYPE-NULL`)
result.
1. 2016-10-21 simpler reference implementation
1. 2016-11-05 first attempt to rename keywords and concept names
1. 2017-05-15 discussion of `LOCATE`
1. 2017-08-08 move example recognizers to discussion/rationale section.
1. 2017-09-12 renamed keywords in XY.6.1 as suggested by the Forth 200x committee
1. 2017-12-06 changed wording from "recognizer stack" to "recognizer sequence".
1. 2017-12-10 created Recognizer EXT section with recognizer sequence management words.
1. 2018-04-09 expanded EXT section with RECTYPE* words
1. 2018-05-11 add comments about `recognizable?`
1. 2018-07-23 finalized
1. 2018-07-24 small bugfixes
1. 2018-08-02 split document into proposal and comments
# Problem
The Forth compiler can be extended easily. The Forth
interpreter however has a fixed set of capabilities as
outlined in section 3.4 of the standard text: Words from
the dictionary and some number formats.
It's not possible to use the Forth text interpreter
in an application or system extension context. Most interpreters in
existing systems use a number of hooks to extent the interpreter.
That makes it possible to use a loadable library to
implement new data types to be handled like the built-in
ones. An example are the floating point numbers. They
have their own parsing and data handling words including
a stack of their own.
Furthermore applications need to use system provided and system specific
words or have to re-invent the wheel to get numbers with a sign or
hex numbers with the $ prefix. The building blocks (`FIND`, `COMPILE,`,
`>NUMBER` etc) are available but there is a gap between them and what
the Forth interpreter already does.
To actually handle data in the Forth context, the
processing actions need to be `STATE` aware. It
would be nice if the Forth text interpreter,
that maintains `STATE`, is able to do the data
processing without exposing `STATE` to the data
handling methods. These different methods need to
be registered somehow.
# Solution
The monolithic design of the Forth interpreter is factored into
three major blocks: First the interpreter. It maintains `STATE`
and organizes the work. Second the actual data parsing. It is
called from the interpreter and analyses strings (sub-strings
of `SOURCE`) if they match the criteria for a certain data
type. These parsing words are grouped to achieve an
order of invocation. The result of the parsing words is handed
over to the interpreter with data specific handling methods.
There are three different methods for each data type depending
on `STATE` and to `POSTPONE` the data.
The combination of a parsing word and the set of data handling words
to deal with the data is called a recognizer. There is no strict 1:1
relation between the parsing words and the data handling sets. A data
handling set for e.g. single cell numbers can be used by different
parsing words.
Whenever the Forth text interpreter is mentioned, the standard
words `EVALUATE` (CORE), `'` (tick, CORE), `INCLUDE-FILE`
(FILE), `INCLUDED` (FILE), `LOAD` (BLOCK) and `THRU` (BLOCK)
are expected to act likewise. This proposal is not about to change
these words, but to provide the tools to do so. As long as the
standard feature set is used, a complete replacement with
recognizers is possible.
This proposal is about the building blocks.
# Proposal
## XY. The optional Recognizer word set
### XY.1 Introduction
The recognizer concept consists of two elements: parsing words
that return data type information that identify the parsed data
and provide methods to perform the various semantics of the data:
interpret, compile and postpone. A parsing word can return
different data type information. A particular data type information
can be used by different parsing words.
A system provided data type information is called `RECTYPE-NULL`.
It is used if no other one is applicable. This token is
associated with the system error actions if used in step
e) of the text interpreter (see Appendix). It is used to
achieve the action d) of the section 3.4 text interpreter.
A recognizing word within the recognizer concept has the stack effect
```
REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
```
This recognizing word must not change the string. When it is called
from the interpreter, it may access `SOURCE` and, if applicable,
even change `>IN`. If `>IN` is not used, any string may serve
as input, otherwise "addr/len" is assumed to be a substring of the
buffer `SOURCE`.
"i*x" is the result of the recognizing action of the string "addr/len".
`RECTYPE-SOMETYPE` is the data type id that the interpreter uses
to execute the interpret, compile or postpone actions for the data `i*x`.
All three actions are called with the "i*x" data as left from the
recognizing word and are generally expected to consume it. They can
have additional stack effects, depending on what
`RECTYPE-SOMETYPE-METHOD` actually does.
```
RECTYPE-SOMETYPE-METHOD ( ... i*x -- j*y )
```
The data "i*x" doesn't have to be on the data stack, it
can be at different places, if applicable. E.g. floating
point numbers have a stack of their own. In this case,
the data stack contains the `RECTYPE-SOMETYPE` information only.
### XY.2 Additional terms and notations
**Data type id**
A cell sized number. It identifies the data type and a method set to perform
the data processing in the text interpreter. The actual numeric value is
system specific.
**Recognizer**
A string parsing word that returns a data type id together
with the parsed data if successful. The string parsing
word is assumed to run within the Forth interpreter and
can access `SOURCE` and `>IN`.
**Recognizer Sequence**
An ordered set of recognizers. It is identified with
a cell sized numeric id.
### XY.3 Additional usage requirements
#### XY.3.1 Data type id
A data type id is a single cell value that
identifies a certain data type. Append table
the following table to table 3.1
Symbol |
Data type |
Size on Stack |
dt |
data type id |
1 cell |
### XY.4 Additional documentation requirements ###
#### XY.4.1 System documentation ####
##### XY.4.1.1 Implementation-defined options #####
No additional options.
##### XY.4.1.2 Ambiguous conditions #####
* Change of the content of the parsed string during parsing.
#### XY.4.2 Program documentation ####
No additional dependencies.
### XY.5 Compliance and labeling ###
The phrase "Providing the Recognizer word set" shall be appended
to the label of any standard system that provides all of the
Recognizer word set.
### XY.6 Glossary ###
#### XY.6.1 Recognizer Words ####
**FORTH-RECOGNIZER** ( -- rec-seq-id ) RECOGNIZER \
A system VALUE with a recognizer sequence id.
It is `VALUE` that can be changed with `TO` to assign a
new recognizer set. This change has immediate effect.
This recognizer set shall be used in all
system level words like `EVALUATE`, `LOAD` etc.
**RECOGNIZE** ( addr len rec-seq-id -- i*x RECTYPE-DATATYPE | RECTYPE-NULL )
RECOGNIZER \
Apply the string at "addr/len" to the elements of the recognizer
set identified by `rec-seq-id`. Terminate the iteration if either
a parsing word returns a data type id that is different from
`RECTYPE-NULL` or the set is exhausted. In this case return
`RECTYPE-NULL`.
"i*x" is the result of the parsing word. It represents the data from
the string. It may be on other locations than the data stack. In this
case the stack diagram should be read accordingly.
**RECTYPE>COMP** ( RECTYPE-DATATYPE -- XT-COMPILE ) RECOGNIZER \
Return the execution token for the compilation action from the
recognizer date type id.
**RECTYPE>INT** ( RECTYPE-DATATYPE -- XT-INTERPRET ) RECOGNIZER \
Return the execution token for the interpretation action from
the recognizer data type id.
**RECTYPE>POST** ( RECTYPE-DATATYPE -- XT-POSTPONE ) RECOGNIZER \
Return the execution token for the postpone action from the
recognizer data type id.
**RECTYPE-NULL** ( -- RECTYPE-NULL ) RECOGNIZER \
The null data type id. It is to be used if no other
data type id is applicable but one is needed. Its
associated methods perform system specific error
actions. The actual numeric value is system dependent.
**RECTYPE:** ( XT-INTERPRET XT-COMPILE XT-POSTPONE "name" -- )
RECOGNIZER \
Skip leading space delimiters. Parse name delimited by a space. Create
a data type id under the name `name` and associate the three execution
tokens.
The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with
the parsed data `i*x` that e.g. `RECOGNIZE` has returned.
The word behind XT-INTERPRET shall have the stack effect
`( ... i*x -- j*y )`. The words behind XT-COMPILE and XT-POSTPONE shall
consume `i*x`.
The execution time of `name` leaves a cell sized token on the data stack
that can be applied to the `RECTYPE>*` words.
#### YZ.6.2 Recognizer Extension Words ####
A Forth system that uses recognizers in the core
has words for numbers and dictionary look-ups.
They shall be named as shown in the table:
Name |
Stack effect |
`REC-NUM` |
`( addr len -- n RECTYPE-NUM | d RECTYPE-DNUM | RECTYPE-NULL )` |
`REC-FLOAT` |
`( addr len -- RECTYPE-FLOAT | RECTYPE-NULL ) (F: -- f | )` |
`REC-FIND` |
`( addr len -- XT +/-1 RECTYPE-XT | RECTYPE-NULL )` |
`REC-NT` |
`( addr len -- NT RECTYPE-NT | RECTYPE-NULL )` |
The recognizer type names, if available, shall be as shown in the table below:
Name |
Stack items |
Comment |
`RECTYPE-NUM` |
`( -- n RECTYPE-NUM)` |
single cell number |
`RECTYPE-DNUM` |
`( -- d RECTYPE-DNUM)` |
double cell number |
`RECTYPE-FLOAT` |
`( -- RECTYPE-FLOAT)`
`(F: -- f )` |
floating point number , |
`RECTYPE-XT` |
`( -- XT +/-1 RECTYPE-XT)` |
word from the dictionary
matching `FIND` |
`RECTYPE-NT` |
`( -- NT RECTYPE-NT)` |
word from the dictionary
with name token NT |
The following words deal with changing and creating recognizer sequences.
**GET-RECOGNIZER** ( rec-seq-id -- rec-n .. rec-1 n ) RECOGNIZER EXT \
Copy the recognizer sequence `rec-1 .. rec-n` to the data stack. The
element `rec-1` is the first in the sequence.
The source is unchanged.
**SET-RECOGNIZER** ( rec-n .. rec-1 n rec-seq-id -- ) RECOGNIZER EXT \
Replace the recognizer sequence identified by `rec-seq-id` with a
new set of `n` recognizers `rec-x`.
If the capacity of the destination sequence is too small to hold all
new elements, an ambiguous situation arises.
NEW-RECOGNIZER-SEQUENCE ( size .. rec-seq-id ) RECOGNIZER EXT
Create a new, empty recognizer sequence with at least
`size` elements.
### XY.7 Reference Implementation ###
Basic recognizer sequence module. It is implemented as a separate
stack.
```
: STACK ( size -- stack-id )
1+ ( size ) CELLS HERE SWAP ALLOT
0 OVER ! \ empty stack
;
: SET-STACK ( item-n .. item-1 n stack-id -- )
2DUP ! CELL+ SWAP CELLS BOUNDS
?DO I ! CELL +LOOP ;
: GET-STACK ( stack-id -- item-n .. item-1 n )
DUP @ >R R@ CELLS + R@ BEGIN
?DUP
WHILE
1- OVER @ ROT CELL - ROT
REPEAT
DROP R> ;
```
The recognizer sequence uses the stack module. Hence the stack-id becomes the
rec-seq-id.
```
: NEW-RECOGNIZER-SEQUENCE STACK ;
: SET-RECOGNIZER SET-STACK ;
: GET-RECOGNIZER GET-STACK ;
\ create the default recognizer sequence
4 NEW-RECOGNIZER-SEQUENCE VALUE FORTH-RECOGNIZER
\ create a simple 3 element structure
: RECTYPE: ( XT-INTERPRET XT-COMPILE XT-POSTPONE "name" -- )
CREATE SWAP ROT , , ,
;
\ decode the data structure created by RECTYPE:
: RECTYPE>POST ( RECTYPE-TOKEN -- XT-POSTPONE ) CELL+ CELL+ @ ;
: RECTYPE>COMP ( RECTYPE-TOKEN -- XT-COMPILE ) CELL+ @ ;
: RECTYPE>INT ( RECTYPE-TOKEN -- XT-INTERPRET) @ ;
\ the null token
:NONAME -1 ABORT" FAILED" ; DUP DUP RECTYPE: RECTYPE-NULL
\ depends on the stack implementation
: RECOGNIZE ( addr len rec-seq-id -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )
DUP >R @
BEGIN
DUP
WHILE
DUP CELLS R@ + @
2OVER 2>R SWAP 1- >R
EXECUTE DUP RECTYPE-NULL <> IF
2R> 2DROP 2R> 2DROP EXIT
THEN
DROP R> 2R> ROT
REPEAT
DROP 2DROP R> DROP RECTYPE-NULL
;
```
## A.XY Informal Appendix ##
### A.XY.1 Text Interpreter ###
The Forth text interpreter can be changed into a generic tool
that is capable to deal with any data type. It maintains `STATE`
and calls the data processing methods according to it. The
example is a full replacement if all necessary recognizers are
available.
The algorithm of the Forth text interpreter as described in
section 3.4 is modified. All subsections of 3.4 apply
unchanged. Change the steps b) and c) from section 3.4 to make them
optional, they can be performed with recognizers. Replace the step
d) with the following steps d) to f)
1. For each element of the recognizer sequence provided by `FORTH-RECOGNIZER`,
starting with the top element, call its parsing method with the sub-string
"name" from step a).
Every parsing method returns an information token and the parsed data from
the analyzed sub-string if successful. Otherwise it returns the system
provided failure token `RECTYPE-NULL` and no further data.
Continue with the next element in the recognizer set until either all are
used or the information token returned from the parsing word is not the
system provided failure token `RECTYPE-NULL`.
2. Use the information token and do one of the following
1. if interpreting execute the interpret method associated with the
information token.
2. if compiling execute the compile method associated with the information
token.
3. Continue with a)
```
: INTERPRET
BEGIN
PARSE-NAME DUP
WHILE
FORTH-RECOGNIZER RECOGNIZE
STATE @ IF RECTYPE>COMP ELSE RECTYPE>INT THEN
EXECUTE
?STACK \ simple housekeeping
REPEAT 2DROP
;
```
### A.XY.2 POSTPONE ###
`POSTPONE` compiles the data returned by `RECOGNIZE` (`i*x`)
into the dictionary as literal(s) and appends the compilation action
of the `RECTYPE-TOKEN` data type id. Later at run-time the `i*x`
data is read back and the compilation action is performed like it
would have been called directly at compile time.
```
: POSTPONE ( "name" -- )
PARSE-NAME FORTH-RECOGNIZER RECOGNIZE DUP >R
RECTYPE>POST EXECUTE R> RECTYPE>COMP COMPILE, ;
```
This implementation assumes a system that uses recognizers only.
### A.XY.3 Test Cases ###
The test cases assume a stack to implement the recognizer set.
```
T{ 4 NEW-RECOGNIZER-SEQUENCE constant RS -> }T
T{ :NONAME 1 ; :NONAME 2 ; :NONAME 3 ; RECTYPE: rectype-1 -> }T
T{ :NONAME 10 ; :NONAME 20 ; :NONAME 30 ; RECTYPE: rectype-2 -> }T
T{ : rec-1 NIP 1 = IF rectype-1 ELSE RECTYPE-NULL THEN ; -> }T
T{ : rec-2 NIP 2 = IF rectype-2 ELSE RECTYPE-NULL THEN ; -> }T
T{ rectype-1 RECTYPE>INT EXECUTE -> 1 }T
T{ rectype-1 RECTYPE>COMP EXECUTE -> 2 }T
T{ rectype-1 RECTYPE>POST EXECUTE -> 3 }T
\ testing RECOGNIZE
T{ 0 RS SET-RECOGNIZER -> }T
T{ S" 1" RS RECOGNIZE -> RECTYPE-NULL }T
T{ ' rec-1 1 RS SET-STACK -> }T
T{ S" 1" RS RECOGNIZE -> rectype-1 }T
T{ S" 10" RS RECOGNIZE -> RECTYPE-NULL }T
T{ ' rec-2 ' rec-1 2 RS SET-STACK -> }T
T{ S" 10" RS RECOGNIZE -> rectype-2 }T
```
The dictionary lookup has the following test cases
```
T{ S" DUP" REC-FIND -> ' DUP -1 RECTYPE-XT }T
T{ S" UNKOWN WORD" REC-FIND -> RECTYPE-NULL }T
```
The number recognizer has the following checks
```
VARIABLE OLD-BASE BASE @ OLD-BASE !
T{ : S-1234 S" 1234" ; -> }T
T{ : D-1234 S" 1234." ; -> }T
T{ : S-UNKNOWN S" unknown word" ; -> }T
T{ : S-DUP S" DUP" ; -> }T
T{ S-1234 FORTH-RECOGNIZER RECOGNIZE -> 1234 RECTYPE-NUM }T
T{ D-1234 FORTH-RECOGNIZER RECOGNIZE -> 1234. RECTYPE-DNUM }T
T{ S-DUP FORTH-RECOGNIZER RECOGNIZE -> ' DUP -1 RECTYPE-XT }T
T{ S-UNKNOWN FORTH-RECOGNIZER RECOGNIZE -> RECTYPE-NULL }T
T{ S" %-10010110" REC-NUM -> -150 RECTYPE-NUM }T
T{ S" %10010110" REC-NUM -> 150 RECTYPE-NUM }T
T{ S" 'Z'" REC-NUM -> char Z RECTYPE-NUM }T
T{ S" ABCXYZ" REC-NUM -> RECTYPE-NULL }T
\ check whether BASE is unchanged
T{ BASE @ OLD-BASE @ = -> -1 }T
```
Floating point numbers are handled likewise
```
T{ : S-1234e5 S" 1234e5" ; -> }T
T{ S-1234e5 REC-FLOAT -> 1234e5 RECTYPE-FLOAT }
T{ S-1234e5 FORTH-RECOGNIZER RECOGNIZE -> 1234e5 RECTYPE-FLOAT }T
```
# Experience #
First ideas to dynamically extend the Forth text interpreter
were published in 2005 at comp.lang.forth by Josh Fuller and J Thomas:
[Additional Recognizers](http://compgroups.net/comp.lang.forth/additional-recognizers/734676)?
A specific solution to deal with number prefixes was
roughly sketched by Anton Ertl at comp.lang.forth in 2007 with
[https://groups.google.com/forum/#!msg/comp.lang.forth/r7Vp3w1xNus/Wre1BaKeCvcJ](https://groups.google.com/forum/#!msg/comp.lang.forth/r7Vp3w1xNus/Wre1BaKeCvcJ)
There are a number of specific solutions that can at least partly be seen
as recognizers in various Forth's:
* prefix-detection in ciforth
* W32Forth uses its "chain" concept to achieve similar effects.
* various commercial Forth's seem to have ways to extent the
interpreter.
* FICL, a system close to Forth, has
parse-steps[](http://ficl.sourceforge.net/parsesteps.html) since approx
2001.
A first generic recognizer concept was implemented in amforth
version 4.3 (May 2011). The design presented in this RFD is
implemented with version 5.3 (May 2014). gforth has
recognizers since 2012, the ones described here since June
2014.
Existing recognizers cover a wide range of data formats
like floating point numbers and strings. Others mimic the
back-tick syntax used in many Unix shells to execute OS
sub-process. A recognizer is used to implement OO
notations.
Most of the small words that constitute a recognizer don't
need a name actually since only their execution tokens are
used. For the major words a naming convention is suggested:
`REC-` for the parsing word, and `RECTYPE-`
for the data type word created with `RECTYPE:` for the data
type "name".
# Acknowledgments #
The following people did major or minor contributions, in
no particular order.
* Bernd Paysan
* Jenny Brien
* Andrew Haley
* Alex McDonald
* Anton Ertl
* Forth 200x Committee
,---------.
| Replies |
`---------´
,------------------------------------------
| 2020-07-03 08:35:50 tolich replies:
| referenceImplementation - Case-sensitivity independent implementation
| see: https://forth-standard.org/standard/tools/BracketELSE#reply-384
`------------------------------------------
I LOVE this implementation not because of its case sensitivity awareness.
You can define any parsing words in addition to [IF] [THEN] [ELSE] thus to avoid them in commentaries, string literals, etc.
: S" '"' PARSE 2DROP ; IMMEDIATE
: .( ')' PARSE 2DROP ; IMMEDIATE
You got the idea. It must be in the standard.
,------------------------------------------
| 2020-07-20 20:39:45 BerndPaysan replies:
| proposal - Recognizer
| see: https://forth-standard.org/proposals/recognizer#reply-385
`------------------------------------------
There are a number of discussions going on elsewhere, and I want to make sure the discussion is going to be here. So for a start, I just took Matthias' version D proposal.
Ideas discussed:
* Bikeshedding the names again (we already did that).
* Revert the postpone action to version A
* Introduce `RECOGNIZER-SEQUENCE:` word, which creates a new recognizer that actually consists of a sequence of those.
Please feel free to introduce these ideas here.