Proposal: An alternative to the RECOGNIZER proposal

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AndrewHaleyavatar of AndrewHaley An alternative to the RECOGNIZER proposalProposal2020-09-05 15:09:39

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

With regard to that last one, I would like recognizers to be associated with wordlists, so that when a recognizer is defined it is in the current wordlist, and when that wordlist is visible so is the recognizer. Recognizer ordering therefore should follow the search order and not use a separate ordering and visibility mechanism. When a recognizer is forgotten, it should simply disappear. This is ideal for libraries that define recognizers and live in their own wordlist: to use a library add its wordlist to the search order, and the recognizers come too.

A rough draft of my proposed change is at

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;

: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

AndrewHaleyavatar of AndrewHaleyNew Version

Show differences

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

With regard to that last one, I would like recognizers to be associated with wordlists, so that when a recognizer is defined it is in the current wordlist, and when that wordlist is visible so is the recognizer. Recognizer ordering therefore should follow the search order and not use a separate ordering and visibility mechanism. When a recognizer is forgotten, it should simply disappear. This is ideal for libraries that define recognizers and live in their own wordlist: to use a library add its wordlist to the search order, and the recognizers come too.

Here's what I suggest as a revised standard. Note that the only change to the test interpreter is Section d. Section 3.X.X is the new part

Text interpretation (see 6.1.1360 EVALUATE and 6.1.2050 QUIT) shall
repeat the following steps until either the parse area is empty or an
ambiguous condition exists:

a) Skip leading spaces and parse a name (see 3.4.1);

b) Search the dictionary name space (see 3.4.2). If a definition name
   matching the string is found:

   1) if interpreting, perform the interpretation semantics of the
   definition (see 3.4.3.2), and continue at a).

   2) if compiling, perform the compilation semantics of the
   definition (see 3.4.3.3), and continue at a).

c) If a definition name matching the string is not found, attempt to
   convert the string to a number (see 3.4.1.3). If successful:

   1) if interpreting, place the number on the data stack, and
   continue at a);

   2) if compiling, compile code that when executed will place the
   number on the stack (see 6.1.1780 LITERAL), and continue at a);

d) Execute any recognizers in the dictionary name space (see 3.X.X) If
   any of them succeeds, continue at a).

e) If unsuccessful, an ambiguous condition exists (see 3.4.4).


3.X.X  User-defined recognizers

Do the following until no more recognizers are found or one of them
succeeds:

   1) Search the dictionary name space (see 3.4.2) for a definition
   whose name is "[RECOGNIZE]". This defintion is called the found
   recognizer. If none is found, the search is terminated.

   2) If found, perform the interpretation sematics of the found
   recognizer, passing it the string which has been parsed and a mode
   which is 1 if interpreting, 2 if compiling:

       [RECOGNIZE] ( a n mode -- flag)

   If the flag returned is true, terminate the search.

   [ Comment: The found recognizer performs some kind of action,
   perhaps compiling something into the dictionary, then returns true
   or false. ]

   If the flag returned is false, continue at 1), but searching the
   dictionary name space from the point of the definition which
   precedes the found recognizer.

[ Discussion:

   This isn't as inefficient as might seem at first glance because the
   result of searching the dictionary for words called [RECOGNIZE] can
   be cached. The dictionary only needs to be re-scanned when the
   search order is changed or a new definition of [RECOGNIZE] is added
   to the dictionary.

   Obviously, this removes any need for recognizer stacks. If a
   recognizer is defined by a library that has its own wordlist, the
   recognizer becomes visible to the interpreter (and therefore
   active) when the library's wordlist is added to the search
   order. This is, I believe, in most cases exactly what will be
   desired: the visibility of library-defined recognizers changes with
   the visibility of the words in the library.

   However, if it's necessary to define a recognizer which can be
   added or removed from view independently of any other definitions,
   it can be defined in a wordlist which contains ony one word, the
   recognizer. This wordlist can be added or removed from the search
   order as required.

   If it's necessary to have POSTPONE actions, another mode can be
   added, and the specification of POSTPONE amended to search for
   user-defined POSTPONE recognizers.

]

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;

: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

AndrewHaleyavatar of AndrewHaleyNew Version

Show differences

As I've said more than once, I think the proposal is too complex and does too much. But I've been challenged to up with an alternative I think is better.

I'd like it to:

- Allow user-defined literals, dot parsers, etc.

- Make small changes to the standard.

- Allow simple implementations.

- Have a small API surface.

- Be easy to use.

- Avoid adding any ambiguous conditions.

- Work well with the rest of the system and is as non-intrusive as possible.

With regard to that last one, I would like recognizers to be associated with wordlists, so that when a recognizer is defined it is in the current wordlist, and when that wordlist is visible so is the recognizer. Recognizer ordering therefore should follow the search order and not use a separate ordering and visibility mechanism. When a recognizer is forgotten, it should simply disappear. This is ideal for libraries that define recognizers and live in their own wordlist: to use a library add its wordlist to the search order, and the recognizers come too.

Here's what I suggest as a revised standard. Note that the only change to the test interpreter is Section d. Section 3.X.X is the new part

Text interpretation (see 6.1.1360 EVALUATE and 6.1.2050 QUIT) shall
repeat the following steps until either the parse area is empty or an
ambiguous condition exists:

a) Skip leading spaces and parse a name (see 3.4.1);

b) Search the dictionary name space (see 3.4.2). If a definition name
   matching the string is found:

   1) if interpreting, perform the interpretation semantics of the
   definition (see 3.4.3.2), and continue at a).

   2) if compiling, perform the compilation semantics of the
   definition (see 3.4.3.3), and continue at a).

c) If a definition name matching the string is not found, attempt to
   convert the string to a number (see 3.4.1.3). If successful:

   1) if interpreting, place the number on the data stack, and
   continue at a);

   2) if compiling, compile code that when executed will place the
   number on the stack (see 6.1.1780 LITERAL), and continue at a);

d) Execute any recognizers in the dictionary name space (see 3.X.X) If
   any of them succeeds, continue at a).

e) If unsuccessful, an ambiguous condition exists (see 3.4.4).


3.X.X  User-defined recognizers

Do the following until no more recognizers are found or one of them
succeeds:

   1) Search the dictionary name space (see 3.4.2) for a definition
   whose name is "[RECOGNIZE]". This defintion is called the found
   recognizer. If none is found, the search is terminated.

   2) If found, perform the interpretation sematics of the found
   recognizer, passing it the string which has been parsed and a mode
   which is 1 if interpreting, 2 if compiling:

       [RECOGNIZE] ( a n mode -- flag)

   If the flag returned is true, terminate the search.

   If the flag returned is false, continue at 1), but searching the
   dictionary name space from the point of the definition which
   precedes the found recognizer.

https://sourceforge.net/p/concurrentforth/code/ci/default/tree/recognizers/forth-interp.txt

The new part is section 3.X.X.

An example of such a recognizer for ` quoted characters is

: rectype-num ( n mode -- n | )   2 = if  postpone literal  then ;

: [recognize] ( a n mode - x...x t ) 
    -rot 3 = if
        dup c@ [char] ` = if
            dup 2 + c@ [char] ` = if
                1+ c@  swap rectype-num  true  exit
            then then
    then  drop 0 ;

For simplicity I haven't included POSTPONE actions or recognizer queries in this proposal, but they are trivial to add if we ever agree that they should be.

[And yes, the numbered modes should really be named constants, but that makes no difference to the idea.]

An implementation of this proposal (for SwiftForth) follows. I suspect it'd be similar in most Forths.

\ Execute all of the recognizers defined in wordlist WID until one
\ succeeds.
: (execute-recognizers) ( a n mode wid -- x...x xt | a n mode 0 )
    s" [recognize]" rot hashed {: a n mode link :}
    link
    begin @rel
        dup @ while
            dup l>name count s" [recognize]" compare(cs) 0= if
                to link
                a n mode link link> execute
                if ( success?)  link link> exit then
                link
            then
    repeat drop
    a n mode 0 ;

\ Execute all of the recognizers in the search order until one
\ succeeds.
: recognizers ( a n mode -- x...x xt | a n 0 )
    context #order @ cells over + swap do
        i @ (execute-recognizers)  
        ?dup if  unloop exit  then
    cell +loop
    drop 0 ;

[ I also had to patch the SwiftForth interpreter to call RECOGNIZERS, but I haven't included that here. ]

ruvavatar of ruv

1. "recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( i*x c-addr u mode -- j*x true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I like the idea of translators.

2. Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash. Forth didn't have reserved names. But now some particular names cannot be used by a program.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

3. Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

This approach doesn't make these things simple.

4. Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).

AndrewHaleyavatar of AndrewHaley

  1. "recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( ix c-addr u mode -- jx true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I think it's clear enough what I'm proposing. Are you objecting to the use of the common word "recognize"?

I like the idea of translators.

  1. Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash.

It's not ambiguous at all: once this feature is enabled, "[recognize]" may not be used as a name for any other purpose. Of course the actual string "[recognize]" isn't fixed; it could be anything. It could even be comething unprintable and untypeable, entered into the dictionary by some means.

Forth didn't have reserved names. But now some particular names cannot be used by a program.

Good point.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

  1. Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

I believe numbers are recognized by >NUMBER.

This section doesn't describe how to create a user-defined Forth text interpreter. That's to be decided.

Reusing the system's text interpreter is performed by EVALUATE.

This approach doesn't make these things simple.

In my opinion, prposals should do one thing, and do them well, in the simplest way possible. The object of this proposal is a simple, flexible way to enable user-defined lierals and e.g. dot parsers.

  1. Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).

Can you explain more? Are you talking about nested parsers?

AndrewHaleyavatar of AndrewHaley

  1. "recognizer" term is used in another meaning and its definition is not provided.

By my terminology, a Forth definition ( ix c-addr u mode -- jx true | i*x false ) tries to translate the lexeme ( c-addr u ) according to the mode.

I think it's clear enough what I'm proposing. Are you objecting to the use of the common word "recognize"?

I like the idea of translators.

  1. Ambiguous conditions

This approach introduces a very big and ugly ambiguous condition: if a program uses a word with the "[recognize]" name for its own purposes, it may crash.

It's not ambiguous at all: once this feature is enabled, if you define "[recognize]" it's going to be called by the interpreter. Of course the actual string "[recognize]" isn't fixed; it could be anything. It could even be something unprintable, entered into the dictionary by some means. It could even be a bit like an IMMEDIATE bit rather than a special name. The point is to put recognizers into the dictionary.

Forth didn't have reserved names. But now some particular names cannot be used by a program.

Good point.

A similar approach is used in SP-Forth. There is a magic name "NOTFOUND".

  1. Low reusing factor

What a program should use to recognize a number? How to create a user-defined Forth text interpreter? How to reuse the system's Forth text interpreter?

I believe numbers are recognized by >NUMBER.

This section doesn't describe how to create a user-defined Forth text interpreter. That's to be decided.

Reusing the system's text interpreter is performed by EVALUATE.

This approach doesn't make these things simple.

In my opinion, prposals should do one thing, and do them well, in the simplest way possible. The object of this proposal is a simple, flexible way to enable user-defined lierals and e.g. dot parsers.

  1. Too limited area of use cases.

E.g. a recognizer for 'X form cannot be defined to recognize X in any currently available form (e.g. wordlist::word form).

This seems like a request for nested parsers and that brings with it the need for other techniques, e.g. recursive descent, precedence grammars, and so on. What does 'a::b.c mean, anyway? Is it ('a)::(b.c) or '(a::b).c or something else? There's no way to know without specifying an operator-precedence grammar, and it's the job of the programmer to do that. This proposal is a very simple way for a Forth user to add to the text interpreter; what they add is up to them.

There's nothing to prevent any user from writing a word to recognize e.g. a context-free grammar and then calling it from [recognize]. But again, if you want to write a parser for Language X in Forth you can do that then pass the string to your parser. This proposal isn't supposed to do that.

JennyBrienavatar of JennyBrien

I like the idea of linking recognizers to wordlists. Generally speaking, if you need to recognize , say FP numbers you also need to recognize the word to handle them. Often you are building and using wordlists on the fly, but forlibrary code Filename=Word Set=Wordlist seems useful.

However: by analogy with:

b) Search the dictionary name space (see 3.4.2). If a definition name matching the string is found:

1) if interpreting, perform the interpretation semantics of the definition (see 3.4.3.2), and continue at a).

2) if compiling, perform the compilation semantics of the definition (see 3.4.3.3), and continue at a).

d) might read:

d) Search the dictionary name space (see 3.4. x..x.) for associated recognizers. If successful:

1) if interpreting, perform the interpretation semantics of the returned rec-type and continue at a). 2) If compiling, perform the compilation semantics of the returned rec-type and continue at a)

3.x.x. Finding Associated Recognizers

Recognizers are associated with the CURRENT wordlist by:

RECOGNIZED ( xt -- ) 
Note: This can be done by having RECOGNIZED add an entry to the current wordlist that performs xt. If so, its name should contain a space, or otherwise be unfindable by SEARCH-WORDLIST. 

They are executed in reverse order of definition for each wordlist in the search order until one returns a rec-type other than REC-FAIL or there are no more to be found

 FIND-RECOGNIZER  ( a n -- rec-type true | rec-fail false )  

ruvavatar of ruv

Re point 4, see the related discussions in comp.lang.forth, news:pot8vf$811$1@gioia.aioe.org (and an implementation).

Combination with tick was discussed in news:rd9sh6$r8v$1@dont-email.me).

What does 'a::b.c mean, anyway?

It means '(a::b.c) (without parentheses), where a::b.c should be resolvable into an ordinary Forth word (i.e., that has default interpretation semantics).

Is it ('a)::(b.c) or '(a::b).c or something else?

('a)::(b.c) does not make sense since 'a is resolved into a single-cell number. '(a::b).c does not make sense for the same reason.

Also, it should be a consequence of the specification for tick prefix (the corresponding recognizer or resolver), i.e. that it resolves the rest part into an ordinary Forth word.

ruvavatar of ruv

This proposal isn't supposed to do that.

The Recognizer API (actually, all the versions and rewordings), minimalistic core API for recognizers, my Resolver API — all of them technically support the mentioned nesting, and they support implementation of independent recognizers for 'X and for X::Y, and that 'X::Y will work automatically.

AndrewHaleyavatar of AndrewHaley

Firstly, there is a need for user-defined literals and some other kinds of prefix notation. Anyone who needs anything more exotic (or powerful) and wants it to be standardized had better provide evidence that it's needed for Forth programs. A good design will have everything you need and nothing more.

Secondly, 'a::b would just work. Any system supporting a::b as wordlist::word would have to redefine FIND to break the tokens apart: a recognizer for '-prefixed words would call FIND, which would find the word.

AndrewHaleyavatar of AndrewHaley

Re Jenny's point. It's necessary to define some mechanism by which "performing the interpretation semantics" of some rec-type might be performed. It seems to me more appropriate to specify exactly how that gets done here: it gets done by the called recognizer word. The "semantics" are whatever the recognizer does.

ruvavatar of ruv

Are you objecting to the use of the common word "recognize"?

Though the common word "recognize" is used in a non usual meaning. Your "recognizer" does not just recognize a lexeme, but also performs interpretation or compilation semantics for the lexeme. It's confusing that performing semantics is a part of recognizing by your interpretation.

Firstly, there is a need for user-defined literals and some other kinds of prefix notation

What is a literal?

By the first glance,'X is a literal, a::b is a literal, 'a::b is a literal too — the run-time semantics for all of them is just to put a number (an xt) into the stack.

Any system supporting a::b as wordlist::word would have to redefine FIND

Do you mean that it should be done in a non standard way (i.e., not over the API you are proposing)?

An issue of your API is that we cannot define 'X format in the general form: '<any-literal-that-is-mapped-to-single-xt>. Ditto we cannot define wordlist::word format in the general form <any-literal-that-is-mapped-to-single-xt>::name.

Re Jenny's point.

Jenny is right substantially (since "rectype" is not used in this proposal). The idea is that the found "[RECOGNIZE]" word should perform interpretation semantics for the lexeme if interpreting, and compilation semantics if compiling.

"If found, perform the interpretation sematics of the found recognizer"

The Forth text interpreter only performs interpretation semantics if interpreting, and compilation semantics if compiling. So this phrase in the specification makes things too confusing. Better to say: "perform the execution semantics".

ruvavatar of ruv

Correction: By the first glance,'X is a literal, 'a::b is a literal too — the run-time semantics for all of them is just to put a number (an xt) into the stack.

Re a::b — it's run-time semantics may be other.

Reply New Version