Proposal: Recognizer RfD rephrase 2020

Retired

This proposal has been moved into this section. Its former address was: /standard/intro

This page is dedicated to discussing this specific proposal

ContributeContributions

UlrichHoffmannavatar of UlrichHoffmann [131] Recognizer RfD rephrase 2020Proposal2020-02-24 09:57:56

Recognizer RfD rephrase 2020

Author: Ulrich Hoffmann
Contact: uho@xlerb.de
Version: 0.8 Date: 2020-02-24 Status: Published

Preamble

This text is a rephrasing of just section XY.2, XY.6, section XY.7 and parts of A.XY of the original recognizer RfD [1] by Matthias Trute that uses terminology and word names closer to that already present in Forth-94 and Forth-2012.

It is not intended to invalidate the susequent RfDs B, C or D [2][3][4]. They reflect the ongoing discussion about Forth recognizers and should be considered valuable documentation of that discussion. This text however is intended to revert the recognizer proposal back to simplicity of concepts and terms making it both easier to understand and use as well as simpler to implement.

This text does not add any new functionality to the original proposal. It merely introduces different terms for the structures already existing in the original proposal. The only difference in functionality is the substitution of the defining word RECOGNIZER: of the original proposal by the word RECOGNIZER (note the missing : ) that - similar to the Forth-94 word WORDLIST - creates a recognizer information token and leaves it on the data stack.

Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.

The following table summarizes the different terms and names:

Term in original proposal Term used here comment
recognizer stack recognizer-order similar to search-order
information token (rit) recognizer information token (rit) explicit and consistent
DO-RECOGNIZER RECOGNIZE avoid hyphen in name
RECOGNIZER: RECOGNIZER similar to WORDLIST, no defining word
R:FAIL UNRECOGNIZED no : in name, better english
REC:xxx recognize-xxx no : in name, better english
R:xxx xxx-recognized no : in name, better english

Items to discuss

  1. Programs that use the word RECOGNIZE (e.g. user-defined text interpreters) most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. For these programs to be portable among standard systems appropriate access words would need to be standardized. [3] and [4] propose such words. Without these access words standardizing the word RECOGNIZE is doubtful. Only standardizing the modified (internal) text interpreter behavior would be sufficient then.

  2. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) create the opaque structure recognizer information token. As an alternative recognizer information tokens could be defined - similar to addresses of counted strings (c-addr) - as special addresses and the structure of memory at that address could be exposed. recognizer information token could then be created by already existing standard words such as CREATE ALLOT ALLOCATE and would have a known layout, e.g. three xts in sequence: { INTERPRET-XT | COMPILE-XT | POSTPONE-XT }. The access words of 1. would not need to be standardized as each standard program could access the xts using already existing standard words for memory acccess.

  3. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) despite its name does not create a recognizer (i.e. a parsing-word plus possible several recognizer information tokens) but a single recognizer information token (triple of interpret/compile/postpone xts characterized by a single-cell value). Another name might reflect this functionality better.

  4. Changes in the standard text interpreter (i.e. that it invokes the word RECOGNIZE internally) has implication on many other words apart from MARKER (e.g. ' ['] EVALUATE INCLUDE-FILE INCLUDED ...). Changes in their behaviour should be mentioned in the propsal. [2] proposes explicit changes for ' ['] MARKER while [3] and [4] have a paragraph describing the implication generally and do not propose i.e. MARKER changes explicitly.

  5. Recognizer information tokens (triple of interpret/compile/postpone xts characterized by a single-cell value) could be named more appropriately. [4] proposes a different name data type id that does not seem to be appropriate. Its general notion seems to mislead into the direction of Forth having a data type system.
    From a classical computer science view recognizers act in the lexical analysis (scanner) phase of a compiler, operating on sequences of characters detecting appropriate lexemes (character subsequences of the input stream) and convert them to tokens. Several lexems might map to the same token (e.g. different sequences of digits map to the token NUM) along with so called attributes (e.g. the value of the number). For this reason tokens are sometimes also called token classes or token types or the kind of the token. These might be good alternative names instead of recognizer information token or data type id. Forth-94 and Forth-2012 use the term ID (as in wordlist-id or file-id) to define characterizing single-cell values so going along the xxx-id would be consistent with existing standard terms. (maybe recognizer-token-id)?

References

[1] Forth Recognizer -- Request For Discussion, Version 1, Matthias Trute, 2014-10-03, access at http://amforth.sourceforge.net/pr/Recognizer-rfc.pdf

[2] Forth Recognizer -- Request For Discussion, Version 2, Matthias Trute, 2015-09-20, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-B.pdf

[3] Forth Recognizer -- Request For Discussion, Version 3, Matthias Trute, 2016-09-04, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-C.pdf

[4] Forth Recognizer -- Request For Discussion, Version 4, Matthias Trute, 2018-08-02, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-D.pdf


Proposal

....

XY.2 Additional terms and notations

Recognizer Information Token: An implementation-dependent single-cell value that identifies the data type and a method table to perform the data processing of the interpreter. A naming convention suggests that the names end with -recognized. Recognizer Information Tokens are abbreviated rit in stack comments.

Recognizer: A combination of a text parsing word that returns recognizer information tokens together with parsed data if successful. The text parsing word is assumed to run in cooperation with SOURCE and >IN. A naming convention suggests that the names start with recognize-.

...

XY.6 Glossary

XY.6.1 Recognizer words

RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED ) RECOGNIZER

Apply the recognizers in the recognizer-order to the string at "addr/len" one after the other. Terminate the iteration if either a recognizer returns a recognizer information token rit that is different from UNRECOGNIZED or the recognizer-order is exhausted. In this case, return UNRECOGNIZED otherwise rit.

"i*x" is the result of the parsing word. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly.

It is an ambiguous condition if the recognizer-order is empty.


GET-RECOGNIZERS ( -- rec-n .. rec-1 n ) RECOGNIZER

Return the execution tokens rec-1 .. rec-n of the parsing words in the recognizer-order. rec-1 identifies the recognizer that is called first and rec-n the execution token of the word that is called last.

The recognizer-order is unaffected.


MARKER ( "<spaces>name" -- ) RECOGNIZER

Extend MARKER to include the current recognize-order in the state preservation.


UNRECOGNIZED ( -- UNRECOGNIZED ) RECOGNIZER

A constant cell sized recognizer information token with two uses: first it is used to deliver the information that a specific recognizer could not deal with the string passed to it. Second it is a predefined recognizer information token whose elements are used when no recognizer from the recognizer-order could handle the passed string. These methods provide the system error actions.

The actual numeric value is system dependent and has no predictable value.


RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) RECOGNIZER

Create a recognizer information token rit with the three execution tokens XT-INTERPRET XT-COMPILE XT-POSTPONE. The implementation is system dependent.

The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data that the associated parsing word of the recognizer returned. The information token itself is consumed by the interpreter.


SET-RECOGNIZERS ( rec-n .. rec-1 n -- ) RECOGNIZER

Set the recognizer-order to the recognizers identified by the execution tokens of their parsing words rec-n .. rec-1. rec-1 will be the parsing word of the recognizer that is called first, rec-n will be the last one.

It is an ambiguous condition, if n is not a positive number.

XY.7 Reference Implementation

\ create a simple 3 element structure
\ rit           : XT-INTERPRET
\ rit CELL+     : XT-COMPILE
\ rit 2 CELLS + : XT-POSTPONE
: RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit )
    HERE >R SWAP ROT , , , R> ;
    
\ system failure recognizer
: notfound ( i*x -- )  -13 THROW ;

' notfound  ' notfound  ' notfound RECOGNIZER CONSTANT UNRECOGNIZED

\ contains the recognizer-order
\ first cell is the current number of recognizers.
10 CELLS BUFFER: recognizer-order
0 recognizer-order !

: SET-RECOGNIZERS ( rec-n .. rec-1 n -- )
    DUP recognizer-order !
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order +
      ROT SWAP ! 1-
    REPEAT DROP 
;

: GET-RECOGNIZERS ( -- rec-n .. rec-1 n )
    recognizer-order @ recognizer-order
    BEGIN
      CELL+ OVER
    WHILE
      DUP @ ROT 1- ROT
    REPEAT 2DROP
    recognizer-order @
;

: RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED )
    recognizer-order @
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order + @
      2OVER 2>R SWAP 1- >R
      EXECUTE DUP UNRECOGNIZED <> IF R> DROP 2R> 2DROP EXIT THEN DROP
      R> 2R> ROT
    REPEAT
    DROP 2DROP
    UNRECOGNIZED
;

POSTPONE

POSTPONE is outside the Forth interpreter:

: POSTPONE ( "\<spaces\>name" -- )
   BL WORD COUNT
   RECOGNIZE
   2 CELLS + @ ( post ) \ get the XT-POSTPONE from recognizer
   EXECUTE
; IMMEDIATE

...

A.XY Informal Annex

A.XY.1 Forth Text Interpreter

The Forth text interpreter turns into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it.

INTERPRETER
: PARSE-NAME ( -- addr u ) BL WORD COUNT ;

: INTERPRET ( addr len -- i*x rid | unrecognized )
    BEGIN
      PARSE-NAME ?DUP IF DROP EXIT THEN \ no more words?
      RECOGNIZE
      STATE @ IF  CELL+ @  ( comp ) ELSE @ ( interp ) THEN \ get the right XT
      EXECUTE \ do the action
      ?STACK \ simple housekeeping
    AGAIN 
;

A.XY.2 Example Recognizers

Word recognizer
\ find-name is close to FIND. amforth specific.
256 BUFFER: find-name-buf

: place ( c-addr1 u c-addr2 )
   2DUP C! CHAR+ SWAP MOVE ;

: find-name ( addr len -- xt +/-1 | 0 )
   find-name-buf place
   find-name-buf
   FIND DUP 0= IF NIP THEN ;

: immediate? ( flags -- true|false ) 0> ;
    
\ Define word recognizer

\ INTERPRET
:NONAME ( i*x XT flags -- j*y )
  DROP EXECUTE ;

\ COMPILE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE EXECUTE THEN ;

\ POSTPONE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE POSTPONE LITERAL POSTPONE COMPILE, THEN ;

RECOGNIZER CONSTANT word-recognized

\ parsing word for word recognizer
: recognize-word ( addr len -- XT flags rid | UNRECOGNIZED )
   find-name ( addr len -- XT flags | 0 )
   ?DUP IF word-recognized ELSE UNRECOGNIZED THEN ;

\ prepend the word recognizer to the recognizer-order
GET-RECOGNIZERS ' recognize-word SWAP 1+ SET-RECOGNIZERS

end of document

ruvavatar of ruv

I think this work on rephrasing, making better terminology and wording, and even reforming the conceptions, — is very important.

My thoughts regarding the terminology are the following.

3. The word "RECOGNIZER".

It seems the most bright conflict in the terminology lays between RECOGNIZER and GET-RECONIZERS words. The former returns rit, the latter returns rec. Such conflict is inadmissible in the Standard. Another term (and another name for the word) should be found instead of "recognizer".

Anyway, holding for this word a kind of semantic similarity to the WORDLIST word looks like a good choice (if any).

5. A name for triple of interpret/compile/postpone xts

This item is closely connected to the above one (3).

It should be taken into account that we already have execution token xt (that identifies execution semantics) and name token nt (that identifies a named definition). I.e., the specification applies the term "token" to an attribute itself (a value only), without the corresponding information about its type (or class, or kind). Hence, the information about the corresponding type should be called "token type" (an identifier of a token type).

Under the hood, this token type identifier should be associated with handlers: how to execute (interpret) the corresponding token, how to compile the corresponding token, etc.

See also my approach in comp.lang.forth post in 2018 (news:pngvcc$pta$1@gioia.aioe.org, copy).

4. Changes in the specifications of other words

Yes, the specification for MARKER should be updated.

But there's no need to mention changes in the behavior of the words that:

ruvavatar of ruv

The items (1) and (2) are more about API than about terminology.

2. rit structure accessors

I think, suggesting programs to use N CELL+ @ to access xt — is a bad choice. Even for the mentioned counted strings we have the COUNT word, i.e. an accessor. Therefore, even for a transparent structure, if we provide an access, we should provide the access words. (But I think, we don't need access to xts, see 1.ii below)

1. interpret/compile/postpone xts

"Programs [...] most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. [...] Without these access words standardizing the word RECOGNIZE is doubtful."

We have the following issues with this:

i. Nobody provides a use-case

Nobody provides a use-case for the scenario when a program needs access to these xts.

OTOH we have a one important principle: a user-defined text interpreter should be implementable in a standard way (in this case, without re-implementing recognizers). NB: POSTPONE-action is not needed for this text interpreter.

ii. Better to avoid successors

Perhaps another way is better. Instead of using the corresponding xts, use the words that do the corresponding actions:

INTERPRET-TOKEN ( i*x token{k*x} token-type -- j*x )
COMPILE-TOKEN ( i*x token{k*x} token-type -- j*x )
POSTPONE-TOKEN ( token{k*x} token-type -- )

I.e., in place of

( ... rit ) _R>COMP EXECUTE

you just do

( ... rit ) COMPILE-TOKEN

Rationale: in the most cases a user/program needs just to perform these actions. Getting an xt and then executing it has an excessive step without any profit.


Other issues regarding the API and examples.

6. Naming in A.XY.2

find-name was proposed to be a standard word that returns a name token nt.

It is better to use another name for your word ( addr len -- xt +/-1 | 0 ).

I can suggest

find-word-in ( c-addr u  wid -- c-addr u 0 | xt immediate-flag true )
find-word ( c-addr u -- c-addr u 0 | xt immediate-flag true )

Rationale: 1) the FIND word returns c-addr on fail 2) when implementing a text interpreter, on fail you need ( c-addr u ) to convert it into the number; 3) in some cases, always the same number of the result items is an advantage for optimization.

7. Action of postponing isn't essential

i. Why does a user (a program) need to use the POSTPONE-action?

The only known use case is to implement ]] ... [[ construct. But this construct, when implemented via the POSTPONE-action, have a set of known flaws: it doesn't follow copy-pastebility design principle, and (as the result) it doesn't handle the immediate parsing words in a convenient way (including comments, and [IF] ... [THEN]).

Actually, a user doesn't need to use the POSTPONE-action, but he just needs to postpone fragments of code! Therefore, it is better to provide something like c{ ... }c construct (see my s-state PoC) that provides full copy-patebility.

ii. Why does a user (a program) have to specify POSTPONE-action for a new token-type?

The only reason is to make a Forth system aware of how to apply POSTPONE to a user-defined literal. But a user doesn't need to apply POSTPONE to the literals if a Forth system provides a way to postpone any fragments of code.

Well, perhaps a Forth system cannot postpone a user-defined literal (or even a parsing word) as part of a fragment of code, if a user doesn't provide POSTPONE-action? But it is wrong. Since any COMPILE-action is defined via the standard words (and the words defined via standard words), then a Forth system is able to postpone any tokens, having a definition of COMPILE-action for them (see the same c-state PoC)

Yes, it is not quite easy in implementation, but it is very convenient in using!

ruvavatar of ruv

Another open question

8. Dependency on STATE

Obviously, the results of RECOGNIZE may depend on search order and BASE. Also, a user-defined recognizer may depend on user-defined states.

But what about STATE-dependency for initial recognizer? May the results of RECOGNIZE depend on STATE?

E.g. recognize-word from example in A.XY.2 is based on FIND, and hence it may depend on STATE. And hence in some cases the result is not allowed to be performed in the different STATE (see also my proposal for FIND clarification).

AntonErtlavatar of AntonErtl

Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.

It seems to me that the seeming stability is treacherous. Stephen Pelc wants to go back to an earlier version of the proposal, and Alex McDonald wants to revise it. We have had renamings already, and as a result, it is not as easy to compare versions as it could be.

However, we have also had a new concept that got the name of an old concept that it replaced: the postpone action was replaced by a time-shifting action (from which the postpone action can be built), but the time-shifting action was still called "postpone action". I am not sure if changing the name now would be helpful, or if it's too late.

StephenPelcavatar of StephenPelc

I apologise for my delay in responding to Ulli's document.

Overall I think that it's a really good first step, and Ruv's comments are also good.

I do not want to go back to an older proposal in particular, I want a proposal that ordinary mortals can understand. I just want clarity.

Naming

I don't much care whether the recogniser triple is called a rid or a rit. The more neutral term seems to be rid, but rit is now in use, so let's keep to it. Either can be pronounced clearly in discussions.

In normal Forth usage the word that lays the implementation- dependent data would be called RECOGNIZER, (with a comma). It should return a rit. What's the point of having an identifier if we don't use it? If we use this terminology, then the obvious way to refer to return values is RIT-NUM, RIT-FLOAT and so on.

Accessors

Do these ever get used outside the internals of other words? If not, a standard team has no business prescribing these. How many other implementations of a rit exist (in the wild) apart from the xt triple? Ruv's point about a use-case is well taken here.

The POSTPONE action xt is needed for two reasons:

  1. POSTPONE needs it
  2. Not all parsers are for literals, e.g. OOP parsers.

We cannot predict how recognisers will be used, so attempting to automate the POSTPONE actions is doomed to failure. OOP is not a hand-waving prediction, VFX's CIAO and ClassVFX packages both use recognisers.

STATE dependency

Having RECOGNISE be dependent of STATE is horrible.

ruvavatar of ruv

Regarding time-shifting action

I am not sure if changing the name now would be helpful, or if it's too late.

I think correcting the name (and corresponding terminology) will be helpful. Since we should not call something "postpone action" if it isn't actually "postpone action" (well, perhaps it is "postponing" in some sense, but not in the sense of POSTPONE word, so it is confusing). Moreover, we still need to refer in discussions the both conceptions: full postpone action and "time-shifting" action.

OTOH, it seems "time-shifting" is a sub-optimal term.

What the corresponding Forth definition should do? It should take the token from the stacks (the data stack, floating-point stack, or something else) and compile code that when executed will place the token on the stacks. In other words it compiles the token as numbers (i.e. literally). So it is distinct from the "compile action" that performs the compilation semantics for the source lexeme.

I suggest the term reproducing. So we can have: interpreting action, reproducing action, compiling action, postponing action (if any).

to reproduce a token: to take the token and compile code that when executed will place the token on the stacks

ruvavatar of ruv

Solution for items 3 and 5.

I also have found a solution for a "RECOGNIZER" mess and a "triple" of xts.

I suggest the term token descriptor (or just descriptor).

token descriptor: an implementation dependent data object (a set of information) that describes translating a token.

Also I have one more idea to discuss how to avoid providing all these (one, or two, or three, or even four) actions when you create a token descriptor.

So far so good.

Further development

I created SpfDev team in ForthHub, and fep-recognizer private repository to design the specification for Recognizer (fep from Forth Enhancement Proposal, after PEP). I included some people that I have found in ForthHub and who made proposals here. Write to me here or in a distinct issue if you are interested to be included.

I published a draft for terms definitions and data types, and an issue for feedback. I created this draft since I see too more issues in the current proposals from the formal point of view, and no answers. See also news:rb43tl$elj$1@dont-email.me (copy). Now I'm looking forward to your thoughts.

Please let me know if you think it is worth to make fep-recognizer repository public (since Gerald Wodni said "normally proposals are developed in smaller groups and only presented to the public once they are pretty solid"). I hope the GitHub tools help us to better organize collaboration on this work.

StephenPelcavatar of StephenPelc

Naming of return values

There's a proposal that we should standardise the values returned by RECOGNIZE (RECTYPE-xxx, rit-xxx ...). After a while debugging two new FP packages on the same host, and even loading one after the other, I believe that this proposal is doomed to fail.

If two float packs return the same value, they are impossible to distinguish and hence to debug. If we return rit-SSE64 and rit-NDPfloat then they can be separated. The source code becomes impenetrable without separate names.

We should also acknowledge that parsers may return one or more rits on success, e.g R:SNUM and R:DNUM . There's no point saying "don't do that"; such systems exist in the wild. They are a natural consequence of what Ruvim calls compound recognisers.

ruvavatar of ruv

Standardizing token descriptors

I think, a set of the basic descriptors should be standardized. We need the descriptors at least for the following tokens: xt, nt, x, xd, f, c-addr u (a string), and also an implementation dependent token for the result of FIND (for back compatibility) — seven descriptors in total.

Should the values be standardized (as in the case of throw codes), or the names? The values are required for binary interoperability — not our case at the moment; also, they allow to reduce the number of names. But in any case, it is better to have names instead of just magic numbers in source code.

I would suggest to form these names using a mnemonical prefix TD- (after "Token Descriptor"). E.g. TD-XT, TD-NT, TD-LIT, TD-2LIT, TD-FLIT, TD-SLIT, TD-WORD.

The names TD-LIT, TD-2LIT, TD-FLIT, TD-SLIT are after the corresponding standard words LITERAL, 2LITERAL, FLITERAL, SLITERAL.

Other variants for them: TD-NUM, TD-2NUM, TD-FLOAT,TD-STRING — are not well mapped into the corresponding words for compilation the tokens.


An FP package, as well as any other, may provide its own token descriptor, if it is reasonable.

ruvavatar of ruv

Why do we need standard token descriptors?

In many cases a user-defined recognizer returns either a token of these basic data types: xt, nt, x, xd, f, (c-addr u), or a tuple of such tokens. Also, the most standard recognizers (i.e. that can be standardized) return some of these tokens too. So, one argument is that the corresponding descriptors have very high factor of reusing.

The second argument is that they are needed to analyze a result of the standard compound recognizers (if any).

The third argument is that in some cases their can help to create a new user-defined descriptor.

One approach to create descriptors

If a user-defined recognizer returns a token that is just a tuple of other known tokens, a descriptor for this token can be created from the tuple of the corresponding descriptors. I.e., there is no need in this case to specify interpreting action, compiling action, reproducing action, etc.

For example, in my comparison of Recognizer and Resolver APIs a token is represented as ( d-numerator u-denominator ) pair. And the descriptor TD-3LIT for this token can be defined as:

TD-2LIT TD-LIT 2  DESCRIPTOR-COMPOSITE{ TD-3LIT }

Or, if a user needs a distinct descriptor for an already known token:

TD-FLIT 1 DESCRIPTOR-COMPOSITE{ TD-FLIT-SPECIAL }

OTOH, at the moment I don't sure that such a method of creating a descriptor is worth to be standardized.

I personally prefer a more easy way, that is based not on the token descriptors but on the token translators. For token descriptors, not every user-defined token descriptor can be created from other standard descriptors. But any user-defined token translator can be always created from other standard token translators, so we don't need to specifying the particular actions at all.

ruvavatar of ruv

Terms definitions

to create a token descriptor (informally): to create an implementation dependent token descriptor object producing its identifier or producing a named definition that returns this identifier. (see other terms at GitHub/ForthHub/fep-recognizer)

An approach to create any token descriptor

If we will add one special predefined descriptor, we will be able to express any possible semantics for the token descriptors, without providing a reproducing or a postponing action.

This descriptor should describe a token that is a pair of xt: ( xt-interpret xt-compile ). Let's call it TD-DUAL for a while.

And let's consider an example: the string literals (see the parse-slit-end definition there). The old recognizer: rec-sliteral ( c-addr1 u1 -- c-addr2 u2 rectype-slit|rectype-slit-parsing | 0 )

The new one:

\ Create the token descriptor for a token ( c-addr u  xt-interp xt-compil )
td-slit td-dual 2 descriptor constant td-slit-parsing

\ A helper that returns the second part of the fully qualified token
: token-slit-parsing ( -- xt-interp xt-compil td-slit-parsing ) ['] parse-slit-end  [: parse-slit-end slit, ;]   td-slit-parsing ;

\ The recognizer for a string literal: a lexeme that starts with '"' (quote)
: recognize-sliteral ( c-addr1 u1 -- c-addr2 u2 td-slit | c-addr2 u2 xt-interp xt-compil td-slit-parsing | 0 )
  dup 0= if nip exit then
  over c@ '"' <> if 2drop 0 exit then
  1 /string dup 0= if token-slit-parsing exit then
  \ todo: fail if '"' is found in the middle of the string (c-addr1 u1)
  2dup + char- c@ '"' = if char- td-slit exit then
  token-slit-parsing
;

NB: we have to provide neither a reproducing action nor a postponing action. POSTPONE "abc" works as expected.

POSTPONE "x can work in the same way as POSTPONE S" — i.e., to append the compilation semantics for the corresponding lexeme. And the compilation semantics for "x is to parse characters up to ", prepend the result string with "x " and compile this united string. Although, I don't sure it has a practical application. If supported, it should work in this way from conceptual point of view, but it is not obligated to be supported.

Automation of POSTPONE action

StephenPelc wrote:

We cannot predict how recognisers will be used, so attempting to automate the POSTPONE actions is doomed to failure.

I know only one case when POSTPONE action cannot be automated: a special behavior that Anton suggested when POSTPONE is applied to a local variable, namely that POSTPONE local-x should be equivalent to local-x POSTPONE LITERAL, that is local-x LIT,. Actually, this behavior violates the specification for POSTPONE, that should append the compilation semantics for local-x in this case. And the only reason for this violation is to seamlessly use of local variables inside ]] ... [[ construct (in the simplest implementation).

Could somebody provide a correct POSTPONE action that cannot be automated?

alextangentavatar of alextangent

Yes; for example a JIT compiler that might want a set of actions where the intent is not to immediately interpret, compile or postpone but to provide input for a later compilation pass.

ruvavatar of ruv

Just to close this question concerning a JIT compiler:

for example a JIT compiler that might want a set of actions where the intent is not to immediately interpret, compile or postpone but to provide input for a later compilation pass.

If the action for compilation (and reproduction) is defined in such a way that it provides input for a later compilation pass, then the automated action for POSTPONE will generate the input for a later compilation pass too, automatically. Since the action for POSTPONE is just expressed via other actions.

So this example of a JIT compiler doesn't show that POSTPONE action cannot be automated.

To be more concrete, please provide a particular code example when POSTPONE action cannot be automated, by your view.

ruvavatar of ruv

Could somebody provide a correct POSTPONE action that cannot be automated?

A postpone action is automated via 1) a reproduce action, 2) a compile action, and 3) the system's compile, word.

In some edge cases, these components are not consistent with each other, namely a compile action may generate code that is not compatible with what the system's compile, generates. In such cases a correct postpone action cannot be automatically generated (see also my post Against a reproducer in a token descriptor).

So between a reproduce (partial postpone) action and a full postpone action the latter one should be chosen.


Personally, I prefer not he descriptor-based approach, but the translator-based approach, which allows to implement a postponing mode. This mode works transparently and automatically everywhere. And it is a far more convenient means than postpone for user-defined literals.

UlrichHoffmannavatar of UlrichHoffmannNew Version: Recognizer RfD rephrase 2020

Hide differences

superceeded by minimalistic core API for recognizers proposal

Recognizer RfD rephrase 2020

Author: Ulrich Hoffmann
Contact: uho@xlerb.de
Version: 0.8 Date: 2020-02-24 Status: Published

Preamble

This text is a rephrasing of just section XY.2, XY.6, section XY.7 and parts of A.XY of the original recognizer RfD [1] by Matthias Trute that uses terminology and word names closer to that already present in Forth-94 and Forth-2012.

It is not intended to invalidate the susequent RfDs B, C or D [2][3][4]. They reflect the ongoing discussion about Forth recognizers and should be considered valuable documentation of that discussion. This text however is intended to revert the recognizer proposal back to simplicity of concepts and terms making it both easier to understand and use as well as simpler to implement.

This text does not add any new functionality to the original proposal. It merely introduces different terms for the structures already existing in the original proposal. The only difference in functionality is the substitution of the defining word RECOGNIZER: of the original proposal by the word RECOGNIZER (note the missing : ) that - similar to the Forth-94 word WORDLIST - creates a recognizer information token and leaves it on the data stack.

Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.

The following table summarizes the different terms and names:

Term in original proposal Term used here comment
recognizer stack recognizer-order similar to search-order
information token (rit) recognizer information token (rit) explicit and consistent
DO-RECOGNIZER RECOGNIZE avoid hyphen in name
RECOGNIZER: RECOGNIZER similar to WORDLIST, no defining word
R:FAIL UNRECOGNIZED no : in name, better english
REC:xxx recognize-xxx no : in name, better english
R:xxx xxx-recognized no : in name, better english

Items to discuss

  1. Programs that use the word RECOGNIZE (e.g. user-defined text interpreters) most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. For these programs to be portable among standard systems appropriate access words would need to be standardized. [3] and [4] propose such words. Without these access words standardizing the word RECOGNIZE is doubtful. Only standardizing the modified (internal) text interpreter behavior would be sufficient then.

  2. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) create the opaque structure recognizer information token. As an alternative recognizer information tokens could be defined - similar to addresses of counted strings (c-addr) - as special addresses and the structure of memory at that address could be exposed. recognizer information token could then be created by already existing standard words such as CREATE ALLOT ALLOCATE and would have a known layout, e.g. three xts in sequence: { INTERPRET-XT | COMPILE-XT | POSTPONE-XT }. The access words of 1. would not need to be standardized as each standard program could access the xts using already existing standard words for memory acccess.

  3. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) despite its name does not create a recognizer (i.e. a parsing-word plus possible several recognizer information tokens) but a single recognizer information token (triple of interpret/compile/postpone xts characterized by a single-cell value). Another name might reflect this functionality better.

  4. Changes in the standard text interpreter (i.e. that it invokes the word RECOGNIZE internally) has implication on many other words apart from MARKER (e.g. ' ['] EVALUATE INCLUDE-FILE INCLUDED ...). Changes in their behaviour should be mentioned in the propsal. [2] proposes explicit changes for ' ['] MARKER while [3] and [4] have a paragraph describing the implication generally and do not propose i.e. MARKER changes explicitly.

  5. Recognizer information tokens (triple of interpret/compile/postpone xts characterized by a single-cell value) could be named more appropriately. [4] proposes a different name data type id that does not seem to be appropriate. Its general notion seems to mislead into the direction of Forth having a data type system.
    From a classical computer science view recognizers act in the lexical analysis (scanner) phase of a compiler, operating on sequences of characters detecting appropriate lexemes (character subsequences of the input stream) and convert them to tokens. Several lexems might map to the same token (e.g. different sequences of digits map to the token NUM) along with so called attributes (e.g. the value of the number). For this reason tokens are sometimes also called token classes or token types or the kind of the token. These might be good alternative names instead of recognizer information token or data type id. Forth-94 and Forth-2012 use the term ID (as in wordlist-id or file-id) to define characterizing single-cell values so going along the xxx-id would be consistent with existing standard terms. (maybe recognizer-token-id)?

References

[1] Forth Recognizer -- Request For Discussion, Version 1, Matthias Trute, 2014-10-03, access at http://amforth.sourceforge.net/pr/Recognizer-rfc.pdf

[2] Forth Recognizer -- Request For Discussion, Version 2, Matthias Trute, 2015-09-20, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-B.pdf

[3] Forth Recognizer -- Request For Discussion, Version 3, Matthias Trute, 2016-09-04, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-C.pdf

[4] Forth Recognizer -- Request For Discussion, Version 4, Matthias Trute, 2018-08-02, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-D.pdf


Proposal

....

XY.2 Additional terms and notations

Recognizer Information Token: An implementation-dependent single-cell value that identifies the data type and a method table to perform the data processing of the interpreter. A naming convention suggests that the names end with -recognized. Recognizer Information Tokens are abbreviated rit in stack comments.

Recognizer: A combination of a text parsing word that returns recognizer information tokens together with parsed data if successful. The text parsing word is assumed to run in cooperation with SOURCE and >IN. A naming convention suggests that the names start with recognize-.

...

XY.6 Glossary

XY.6.1 Recognizer words

RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED ) RECOGNIZER

Apply the recognizers in the recognizer-order to the string at "addr/len" one after the other. Terminate the iteration if either a recognizer returns a recognizer information token rit that is different from UNRECOGNIZED or the recognizer-order is exhausted. In this case, return UNRECOGNIZED otherwise rit.

"i*x" is the result of the parsing word. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly.

It is an ambiguous condition if the recognizer-order is empty.


GET-RECOGNIZERS ( -- rec-n .. rec-1 n ) RECOGNIZER

Return the execution tokens rec-1 .. rec-n of the parsing words in the recognizer-order. rec-1 identifies the recognizer that is called first and rec-n the execution token of the word that is called last.

The recognizer-order is unaffected.


MARKER ( "<spaces>name" -- ) RECOGNIZER

Extend MARKER to include the current recognize-order in the state preservation.


UNRECOGNIZED ( -- UNRECOGNIZED ) RECOGNIZER

A constant cell sized recognizer information token with two uses: first it is used to deliver the information that a specific recognizer could not deal with the string passed to it. Second it is a predefined recognizer information token whose elements are used when no recognizer from the recognizer-order could handle the passed string. These methods provide the system error actions.

The actual numeric value is system dependent and has no predictable value.


RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) RECOGNIZER

Create a recognizer information token rit with the three execution tokens XT-INTERPRET XT-COMPILE XT-POSTPONE. The implementation is system dependent.

The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data that the associated parsing word of the recognizer returned. The information token itself is consumed by the interpreter.


SET-RECOGNIZERS ( rec-n .. rec-1 n -- ) RECOGNIZER

Set the recognizer-order to the recognizers identified by the execution tokens of their parsing words rec-n .. rec-1. rec-1 will be the parsing word of the recognizer that is called first, rec-n will be the last one.

It is an ambiguous condition, if n is not a positive number.

XY.7 Reference Implementation

\ create a simple 3 element structure
\ rit           : XT-INTERPRET
\ rit CELL+     : XT-COMPILE
\ rit 2 CELLS + : XT-POSTPONE
: RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit )
    HERE >R SWAP ROT , , , R> ;
    
\ system failure recognizer
: notfound ( i*x -- )  -13 THROW ;

' notfound  ' notfound  ' notfound RECOGNIZER CONSTANT UNRECOGNIZED

\ contains the recognizer-order
\ first cell is the current number of recognizers.
10 CELLS BUFFER: recognizer-order
0 recognizer-order !

: SET-RECOGNIZERS ( rec-n .. rec-1 n -- )
    DUP recognizer-order !
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order +
      ROT SWAP ! 1-
    REPEAT DROP 
;

: GET-RECOGNIZERS ( -- rec-n .. rec-1 n )
    recognizer-order @ recognizer-order
    BEGIN
      CELL+ OVER
    WHILE
      DUP @ ROT 1- ROT
    REPEAT 2DROP
    recognizer-order @
;

: RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED )
    recognizer-order @
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order + @
      2OVER 2>R SWAP 1- >R
      EXECUTE DUP UNRECOGNIZED <> IF R> DROP 2R> 2DROP EXIT THEN DROP
      R> 2R> ROT
    REPEAT
    DROP 2DROP
    UNRECOGNIZED
;

POSTPONE

POSTPONE is outside the Forth interpreter:

: POSTPONE ( "\<spaces\>name" -- )
   BL WORD COUNT
   RECOGNIZE
   2 CELLS + @ ( post ) \ get the XT-POSTPONE from recognizer
   EXECUTE
; IMMEDIATE

...

A.XY Informal Annex

A.XY.1 Forth Text Interpreter

The Forth text interpreter turns into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it.

INTERPRETER
: PARSE-NAME ( -- addr u ) BL WORD COUNT ;

: INTERPRET ( addr len -- i*x rid | unrecognized )
    BEGIN
      PARSE-NAME ?DUP IF DROP EXIT THEN \ no more words?
      RECOGNIZE
      STATE @ IF  CELL+ @  ( comp ) ELSE @ ( interp ) THEN \ get the right XT
      EXECUTE \ do the action
      ?STACK \ simple housekeeping
    AGAIN 
;

A.XY.2 Example Recognizers

Word recognizer
\ find-name is close to FIND. amforth specific.
256 BUFFER: find-name-buf

: place ( c-addr1 u c-addr2 )
   2DUP C! CHAR+ SWAP MOVE ;

: find-name ( addr len -- xt +/-1 | 0 )
   find-name-buf place
   find-name-buf
   FIND DUP 0= IF NIP THEN ;

: immediate? ( flags -- true|false ) 0> ;
    
\ Define word recognizer

\ INTERPRET
:NONAME ( i*x XT flags -- j*y )
  DROP EXECUTE ;

\ COMPILE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE EXECUTE THEN ;

\ POSTPONE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE POSTPONE LITERAL POSTPONE COMPILE, THEN ;

RECOGNIZER CONSTANT word-recognized

\ parsing word for word recognizer
: recognize-word ( addr len -- XT flags rid | UNRECOGNIZED )
   find-name ( addr len -- XT flags | 0 )
   ?DUP IF word-recognized ELSE UNRECOGNIZED THEN ;

\ prepend the word recognizer to the recognizer-order
GET-RECOGNIZERS ' recognize-word SWAP 1+ SET-RECOGNIZERS

end of document

Retired
Reply New Version