Digest #118 2020-09-06

Contributions

[159] 2020-09-05 15:09:39 AndrewHaley wrote:

proposal - An alternative to the RECOGNIZER proposal

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

With regard to that last one, I would like recognizers to be associated with wordlists, so that when a recognizer is defined it is in the current wordlist, and when that wordlist is visible so is the recognizer. Recognizer ordering therefore should follow the search order and not use a separate ordering and visibility mechanism. When a recognizer is forgotten, it should simply disappear. This is ideal for libraries that define recognizers and live in their own wordlist: to use a library add its wordlist to the search order, and the recognizers come too.

A rough draft of my proposed change is at

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;
    
: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

Replies

[r489] 2020-09-04 11:53:28 ruv replies:

proposal - Nestable Recognizer Sequences

get-rec-sequence ( xt -- xt1 .. xtn n )

If xt refers to a recognizer sequence, return the contained recognizers. If xt refers to a deferred word, perform DEFER@ followed by GET-REC-SEQUENCE (i.e., GET-REC-SEQUENCE works through deferred words).

IF xt refers to neither, return 0.

If recognizer sequences are immutable, a recognizer that is not a sequence can be viewed as a sequence with a single element. I.e., get-rec-sequence can have effect ( xt -- xt 1 ) for such recognizer. Can it be useful?

[r490] 2020-09-04 17:52:36 UlrichHoffmann replies:

proposal - Recognizer

The output of the scanner is often called "token class":

From Wikipedia (Lexical Analysis)

Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.

Also see "Modern Compiler Design" Second Edition, Dick Grune • Kees van Reeuwijk • Henri E. Bal • Ceriel J.H. Jacobs • Koen Langendoen they use the same term.

So it might be reasonable to call RECTYPE: just TOKEN-CLASS.

[r491] 2020-09-04 20:21:59 UlrichHoffmann replies:

proposal - Recognizer

or maybe TOKENCLASS ?

[r492] 2020-09-05 15:23:52 AndrewHaley replies:

proposal - An alternative to the RECOGNIZER proposal

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

Here's what I suggest as a revised standard. Note that the only change to the test interpreter is Section d. Section 3.X.X is the new part

Text interpretation (see 6.1.1360 EVALUATE and 6.1.2050 QUIT) shall
repeat the following steps until either the parse area is empty or an
ambiguous condition exists:

a) Skip leading spaces and parse a name (see 3.4.1);

b) Search the dictionary name space (see 3.4.2). If a definition name
   matching the string is found:

   1) if interpreting, perform the interpretation semantics of the
   definition (see 3.4.3.2), and continue at a).

   2) if compiling, perform the compilation semantics of the
   definition (see 3.4.3.3), and continue at a).

c) If a definition name matching the string is not found, attempt to
   convert the string to a number (see 3.4.1.3). If successful:

   1) if interpreting, place the number on the data stack, and
   continue at a);

   2) if compiling, compile code that when executed will place the
   number on the stack (see 6.1.1780 LITERAL), and continue at a);

d) Execute any recognizers in the dictionary name space (see 3.X.X) If
   any of them succeeds, continue at a).

e) If unsuccessful, an ambiguous condition exists (see 3.4.4).


3.X.X  User-defined recognizers

Do the following until no more recognizers are found or one of them
succeeds:

   1) Search the dictionary name space (see 3.4.2) for a definition
   whose name is "[RECOGNIZE]". This defintion is called the found
   recognizer. If none is found, the search is terminated.

   2) If found, perform the interpretation sematics of the found
   recognizer, passing it the string which has been parsed and a mode
   which is 1 if interpreting, 2 if compiling:

       [RECOGNIZE] ( a n mode -- flag)

   If the flag returned is true, terminate the search.

   [ Comment: The found recognizer performs some kind of action,
   perhaps compiling something into the dictionary, then returns true
   or false. ]

   If the flag returned is false, continue at 1), but searching the
   dictionary name space from the point of the definition which
   precedes the found recognizer.

[ Discussion:

   This isn't as inefficient as might seem at first glance because the
   result of searching the dictionary for words called [RECOGNIZE] can
   be cached. The dictionary only needs to be re-scanned when the
   search order is changed or a new definition of [RECOGNIZE] is added
   to the dictionary.

   Obviously, this removes any need for recognizer stacks. If a
   recognizer is defined by a library that has its own wordlist, the
   recognizer becomes visible to the interpreter (and therefore
   active) when the library's wordlist is added to the search
   order. This is, I believe, in most cases exactly what will be
   desired: the visibility of library-defined recognizers changes with
   the visibility of the words in the library.

   However, if it's necessary to define a recognizer which can be
   added or removed from view independently of any other definitions,
   it can be defined in a wordlist which contains ony one word, the
   recognizer. This wordlist can be added or removed from the search
   order as required.

   If it's necessary to have POSTPONE actions, another mode can be
   added, and the specification of POSTPONE amended to search for
   user-defined POSTPONE recognizers.

]

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;
    
: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

[r493] 2020-09-05 15:29:50 AndrewHaley replies:

proposal - An alternative to the RECOGNIZER proposal

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

Here's what I suggest as a revised standard. Note that the only change to the test interpreter is Section d. Section 3.X.X is the new part

Text interpretation (see 6.1.1360 EVALUATE and 6.1.2050 QUIT) shall
repeat the following steps until either the parse area is empty or an
ambiguous condition exists:

a) Skip leading spaces and parse a name (see 3.4.1);

b) Search the dictionary name space (see 3.4.2). If a definition name
   matching the string is found:

   1) if interpreting, perform the interpretation semantics of the
   definition (see 3.4.3.2), and continue at a).

   2) if compiling, perform the compilation semantics of the
   definition (see 3.4.3.3), and continue at a).

c) If a definition name matching the string is not found, attempt to
   convert the string to a number (see 3.4.1.3). If successful:

   1) if interpreting, place the number on the data stack, and
   continue at a);

   2) if compiling, compile code that when executed will place the
   number on the stack (see 6.1.1780 LITERAL), and continue at a);

d) Execute any recognizers in the dictionary name space (see 3.X.X) If
   any of them succeeds, continue at a).

e) If unsuccessful, an ambiguous condition exists (see 3.4.4).


3.X.X  User-defined recognizers

Do the following until no more recognizers are found or one of them
succeeds:

   1) Search the dictionary name space (see 3.4.2) for a definition
   whose name is "[RECOGNIZE]". This defintion is called the found
   recognizer. If none is found, the search is terminated.

   2) If found, perform the interpretation sematics of the found
   recognizer, passing it the string which has been parsed and a mode
   which is 1 if interpreting, 2 if compiling:

       [RECOGNIZE] ( a n mode -- flag)

   If the flag returned is true, terminate the search.

   If the flag returned is false, continue at 1), but searching the
   dictionary name space from the point of the definition which
   precedes the found recognizer.

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;
    
: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

[r494] 2020-09-05 21:40:18 ruv replies:

proposal - Recognizer

TL;DR: token class from lexers is a wrong association.

The "token" term in lexical analysis

1. in computer science, the token term is used in various meanings.

2. In lexical analysis (and compilers theory), token is actually a shorthand for lexical token. (It's the same as in the Forth topic, the "definition" term is a shorthand for "Forth definition").

3. A lexical token is a tuple of a lexeme and the kind of this lexeme

(it's my rewording from Wikipedia, and also: "The lexeme's type combined with its value is what properly constitutes a token").

A lexical token usually doesn't bear any semantic information, it bears only lexical kind — it it an identifier, a number, string literal, or a particular key word.

Lexical tokens are not used in Forth since Forth doesn't distinguish lexemes by the different kinds.

4. In Forth, the token term is not used in the sense of lexical token. As we can conclude from the "execution token" and "name token" terms, in the Forth standard a token is just a kind of identifier, symbol, or something that represents something another (in general case; but numbers can represent themselves).

Lexical token class

The output of the scanner is often called "token class"

It's not quite correct. Output of the scanner (i.e. a lexer) is a sequence of lexical tokens.

Concerning "token class" — it is a shorthand for "lexical token class".

And actually, token class, token type, token category, token name (in the famous Dragon book) — all of them refers to the same thing, that I have called lexeme kind above.

Qualification of the tokens in Forth

We need to name the entity that qualifies a token.

We can call it "token class" (but without referring to "token class" in lexers). Initially I called this entity "token type". But then I realized that "token descriptor" is far better.

This entity can be created in run time, and it describes how to translate a token, — it is not an abstraction, it is an actual object (and the identifier of this object). So "create descriptor" shorthand sounds better than "create class" or "create type" (that look as more abstract things).

Only when we find names for the terms (in the human language, in the language of the standard ), we can find good names for words.

[r495] 2020-09-05 23:06:38 ruv replies:

proposal - An alternative to the RECOGNIZER proposal

1. "recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( i*x c-addr u mode -- j*x true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I like the idea of translators.

2. Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash. Forth didn't have reserved names. But now some particular names cannot be used by a program.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

3. Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

This approach doesn't make these things simple.

4. Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).