Digest #119 2020-09-07

Contributions

[160] 2020-09-06 09:40:07 BerndPaysan wrote:

proposal - minimalistic core API for recognizers

Author:

Bernd Paysan

Change Log:

2020-09-06 initial version

Problem:

The current recognizer proposal has received a number of critics. One is that its API is too big. So this proposal tries to create a very minimalistic API for a core recognizer, and allows to implement more fancy stuff as extensions. The problem this proposal tries to solve is the same as with the original recognizer proposal, this proposal is therefore not a full proposal, but sketches down some changes to the original proposal.

Solution:

Define the essentials of the recognizer in a RECOGNIZER word set, and allow building upon that. Common extensions go to the RECOGNIZER EXT wordset.

Important changes to the original proposal:

Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
Make the recognizer sequence executable with the same effect as a recognizer
Make the system's forth-recognizer a deferred word to allow plugging in new recognizer sequences

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

The core principle is still that the recognizer is not aware of state, and the returned data type id is. If you have for some reason legacy code that looks like

: rec-nt ( addr u -- rectype )
  here place  here find dup IF
      0< state @ and  IF  compile,  ELSE  execute  THEN  ['] drop
  ELSE  drop ['] rectype-null  THEN ;

then be told that this is not the right way, even though it looks like it is working.

Typical use

TBD

Proposal:

XY. The optional Recognizer Wordset

A recognizer takes a string and returns a rectype+additional data on the stack (no additional data for RECTYPE-NULL):

REC-SOMETYPE ( addr len -- i*x RECTYPE-SOMETYPE | RECTYPE-NULL )

XY.3 Additional usage requirements

XY.3.1 Data type id

rectype: subtype of xt, and executes with the following stack effect:

RECTYPE-SOMETYPE ( i*x state -- j*x )

state is:

0 for interpretation
-1 for compilation
-2 for POSTPONE

i?x is the additional information provided by the recognizer.

XY.6 Glossary

XY.6.1 Recognizer Words

FORTH-RECOGNIZER ( addr len -- i*x rectype | RECTYPE-NULL ) RECOGNIZER

This is a deferred word. It takes a string and tries to recognize it, returning the recognized recognizer type and additional information if successful, or RECTYPE-NULL if not.

RECTYPE-NULL ( state -- ) RECOGNIZER

Performs -13 THROW if the exception wordset is available.

Reference implementation:

This is a minimalistic core implementation for a recognizer-enabled system, that handles only words and single numbers without base prefix:

Defer forth-recognizer ( addr u -- i*x rectype / rectype-null )
: interpret ( i*x -- j*x )
  BEGIN
      ?stack parse-name dup  WHILE
      forth-recognizer state @ swap execute
  REPEAT ;

: lit,  ( n -- )  postpone literal ;
: rectype-null ( state -- ) -13 throw ;
: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;
: rectype-num ( n state -- )
  case
      -1 of   lit,  endof
      -2 of   lit, postpone lit,  endof
  endcase ;

: rec-nt ( addr u -- nt rectype-nt / rectype-null )
  forth-wordlist find-name-in dup IF  ['] rectype-nt  ELSE  drop ['] rectype-null  THEN ;
: rec-num ( addr u -- n rectype-num / rectype-null )
  0. 2swap >number 0= IF  2drop ['] rectype-num  ELSE  2drop drop ['] rectype-null  THEN ;

: minimal-recognizer ( addr u -- nt rectype-nt / n rectype-num / rectype-null )
  2>r
  2r@ rec-nt dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r@ rec-num dup ['] rectype-null <> IF  EXIT  THEN  drop
  2r> 2drop ['] rectype-null ;

' minimal-recognizer is forth-recognizer

Testing

Replies

[r496] 2020-09-06 06:37:35 AntonErtl replies:

proposal - Recognizer

In Forth we have a number of tokens already: execution token, name token. Therefore, the terminology of a book about compilers would be misleading in Forth (cf. the meaning of "word" in a computer architecture book).

Moreover, different recognizers often produce the same rectype. E.g., in Gforth, rectype-num is returned by at least rec-num, rec-tick, and rec-body. So the role of a rectype is different from that of a token class in Grune's book.

One advantage that the rectype-* names have over tokenclass-* is that the association with recognizers is more obvious.

Finally, the committee already had a discussion on the names, and agreed on the rectype-* names. As far as I am concerned, that is good enough. And in particular, I think that the incremental benefits of better names (even if one could be agreed upon, which is doubtful) are not worth the costs of trying to find it, finding consensus on it, changing the existing code and documentation, and making existing discussions about this concept harder to understand.

[r497] 2020-09-06 12:43:23 AndrewHaley replies:

proposal - An alternative to the RECOGNIZER proposal

"recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( ix c-addr u mode -- jx true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I think it's clear enough what I'm proposing. Are you objecting to the use of the common word "recognize"?

I like the idea of translators.

Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash.

It's not ambiguous at all: once this feature is enabled, "[recognize]" may not be used as a name for any other purpose. Of course the actual string "[recognize]" isn't fixed; it could be anything. It could even be comething unprintable and untypeable, entered into the dictionary by some means.

Forth didn't have reserved names. But now some particular names cannot be used by a program.

Good point.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

I believe numbers are recognized by >NUMBER.

This section doesn't describe how to create a user-defined Forth text interpreter. That's to be decided.

Reusing the system's text interpreter is performed by EVALUATE.

This approach doesn't make these things simple.

In my opinion, prposals should do one thing, and do them well, in the simplest way possible. The object of this proposal is a simple, flexible way to enable user-defined lierals and e.g. dot parsers.

Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).

Can you explain more? Are you talking about nested parsers?

[r498] 2020-09-06 13:08:07 AndrewHaley replies:

proposal - An alternative to the RECOGNIZER proposal

"recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( ix c-addr u mode -- jx true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I think it's clear enough what I'm proposing. Are you objecting to the use of the common word "recognize"?

I like the idea of translators.

Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash.

It's not ambiguous at all: once this feature is enabled, if you define "[recognize]" it's going to be called by the interpreter. Of course the actual string "[recognize]" isn't fixed; it could be anything. It could even be something unprintable, entered into the dictionary by some means. It could even be a bit like an IMMEDIATE bit rather than a special name. The point is to put recognizers into the dictionary.

Forth didn't have reserved names. But now some particular names cannot be used by a program.

Good point.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

I believe numbers are recognized by >NUMBER.

This section doesn't describe how to create a user-defined Forth text interpreter. That's to be decided.

Reusing the system's text interpreter is performed by EVALUATE.

This approach doesn't make these things simple.

In my opinion, prposals should do one thing, and do them well, in the simplest way possible. The object of this proposal is a simple, flexible way to enable user-defined lierals and e.g. dot parsers.

Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).

This seems like a request for nested parsers and that brings with it the need for other techniques, e.g. recursive descent, precedence grammars, and so on. What does 'a::b.c mean, anyway? Is it ('a)::(b.c) or '(a::b).c or something else? There's no way to know without specifying an operator-precedence grammar, and it's the job of the programmer to do that. This proposal is a very simple way for a Forth user to add to the text interpreter; what they add is up to them.

There's nothing to prevent any user from writing a word to recognize e.g. a context-free grammar and then calling it from [recognize]. But again, if you want to write a parser for Language X in Forth you can do that then pass the string to your parser. This proposal isn't supposed to do that.

[r499] 2020-09-06 16:10:33 JennyBrien replies:

proposal - An alternative to the RECOGNIZER proposal

I like the idea of linking recognizers to wordlists. Generally speaking, if you need to recognize , say FP numbers you also need to recognize the word to handle them. Often you are building and using wordlists on the fly, but forlibrary code Filename=Word Set=Wordlist seems useful.

However: by analogy with:

b) Search the dictionary name space (see 3.4.2). If a definition name matching the string is found:

if interpreting, perform the interpretation semantics of the definition (see 3.4.3.2), and continue at a).

if compiling, perform the compilation semantics of the definition (see 3.4.3.3), and continue at a).

d) might read:

d) Search the dictionary name space (see 3.4. x..x.) for associated recognizers. If successful:

if interpreting, perform the interpretation semantics of the returned rec-type and continue at a).

If compiling, perform the compilation semantics of the returned rec-type and continue at a)

3.x.x. Finding Associated Recognizers

Recognizers are associated with the CURRENT wordlist by:

RECOGNIZED ( xt -- ) 
Note: This can be done by having RECOGNIZED add an entry to the current wordlist that performs xt. If so, its name should contain a space, or otherwise be unfindable by SEARCH-WORDLIST.

They are executed in reverse order of definition for each wordlist in the search order until one returns a rec-type other than REC-FAIL or there are no more to be found

 FIND-RECOGNIZER  ( a n -- rec-type true | rec-fail false )

[r500] 2020-09-06 17:42:23 JennyBrien replies:

proposal - minimalistic core API for recognizers

This replaces one poor man's method dispatch with another poor man's method dispatch, which is maybe less daunting and more flexible.

I don't think so. It doesn't make much difference in application, because you (almost always?) need to consume the rec-type immediately to use whatever else might be on the stack(s). It you already know what you've got, but, for example, can't remember the words to POSTPONE it you could with an active RECTYPE do something like:

    -2 RECTYPE-X

But mostly you'll have the RECTYPE sitting passively on the stack as a return for a recognizer, and I don't see a great deal of difference between:

    : postponed  -2 swap execute ;

and

    : postponed  @ execute ;

Passive rectypes are easier to use (no need to remember to when to tick them) and easier to code (no need to check for a bogus mode on the stack)

Compare:

: rectype-nt ( nt state -- )
  case
      0  of  name>interpret execute  endof
      -1 of  name>compile execute  endof
      -2 of  name>compile swap lit, compile,  endof
      nip // do nothing if state is unknown; possible error handling goes here
  endcase ;

with:

 : rectype: create , , , ;
 :noname name>interpret execute ;
 ;noname name>compile execute ;
 ;noname name>compile swap lit, compile, ;  rectype: rectype-nt

[r501] 2020-09-06 19:02:34 AntonErtl replies:

proposal - Nestable Recognizer Sequences

We have defining words and id-creating words in the standard already. Here I proposed a defining word because it's useful to have a name for the sequence, for building the next sequence. Also, named entities are useful for introspection features like ORDER. The namelessness of WORDLIST is useful when you want to use it as a building stone for, e.g., VOCABULARY. Gforth has NONAME, so in such a system you can create nameless recognizer sequences.

The interface of recognizers means that you have to perform some checking after every recognizer has been called (if successful, then return from the sequence). That's repetetive to write as a colon definition.

The result of GET-REC-SEQUENCE for a non-rec-sequence (and not deferred word) can be either way, and you can do the other from the one you have. If you want to enumerate all the base recognizers (e.g., for printing), if find the proposed version slightly easier. It's also less misleading when the passed xt is not actually a recognizer.

[r502] 2020-09-06 20:25:08 AntonErtl replies:

proposal - Nestable Recognizer Sequences

Recognizer sequence building with a binary constructor

The proposal above allows you to build rec-sequences with n recognizers in order to support implementing interfaces like GET-RECOGNIZER SET-RECOGNIZER based on in. An alternative is to allow exactly two recognizers in a sequence, and build bigger sequences as a (possibly degenerate) tree of such sequence nodes. In many use cases you combine only two recognizers into a rec-sequence at one time, anyway (see the typical use above).

Whether to provide the binary constructor should produce a named or unnamed sequence is not clear to me. I'll use named sequences in the rest of this posting. Whether to pass the first recognizer on top or bottom is also unclear (top in the following). So we define

: two-recognizers ( xt1 xt2 "name" -- )
  create , ,
does>
  dup >r @ execute dup rectype-null <> if
    r> drop exit then
  r> cell+ @ execute ;

Typical Use

' rec-num ' rec-nt two-recognizers rec-forth-cm ( c-addr u -- ... rectype )
' rec-float ' rec-forth-cm two-recognizers rec-forth ( c-addr u -- ... rectype )
' rec-forth ' rec-dot two-recognizers rec-.forth ( c-addr u -- ... rectype )
' rec-user forth-recognizer two-recognizers rec-forthuser
forth-recognizer ( old )
' rec-forthuser to forth-recognizer
\ some code that uses REC-USER:
...
\ now restore the old recognizer sequence
( old ) to forth-recognizer

Discussion

TWO-RECOGNIZERS (as well as a corresponding getter and setter words) are much shorter to implement than for REC-SEQUENCE. The downside is that they cannot be used to implement GET-RECOGNIZER and SET-RECOGNIZER

[r503] 2020-09-06 20:35:46 BerndPaysan replies:

proposal - minimalistic core API for recognizers

One possible thing is to have an automatic postpone for literals.

: rectype-lit: ( compile-xt "name" -- )
  create ,
  does> @ swap
  case
      0  of  drop  endof
      -1 of  execute  endof
      -2 of  dup >r execute r> compile,  endof
  endcase ;

' lit, rectype-lit: rectype-num
' 2lit, rectype-lit: rectype-dnum
' flit, rectype-lit: rectype-float
' slit, rectype-lit: rectype-string

This works with this method, but not with the previous way.

[r504] 2020-09-06 20:38:26 BerndPaysan replies:

proposal - minimalistic core API for recognizers

Furthermore, obviously anyone sane who doesn't want to be 100% minimal would instantly define

: rectype: ( xt-int xt-comp xt-post "name" -- )
  create , , , does> swap 2 + cells + @ execute ;

and then define generic rectypes just like in Matthias Trute's version with rectype: