Forth 2012 Standard

1 Introduction

1.1 Purpose

The purpose of this standard is to promote the portability of Forth programs for use on a wide variety of computing systems, to facilitate the communication of programs, programming techniques, and ideas among Forth programmers, and to serve as a basis for the future evolution of the Forth language.

1.2 Scope

This standard specifies an interface between a Forth System and a Forth Program by defining the words provided by a Standard System.

1.2.1 Inclusions

This standard specifies:
  • the forms that a program written in the Forth language may take;
  • the rules for interpreting the meaning of a program and its data.

1.2.2 Exclusions

This standard does not specify:

  • the mechanism by which programs are transformed for use on computing systems;
  • the operations required for setup and control of the use of programs on computing systems;
  • the method of transcription of programs or their input or output data to or from a storage medium;
  • the program and Forth system behavior when the rules of this standard fail to establish an interpretation;
  • the size or complexity of a program and its data that will exceed the capacity of any specific computing system or the capability of a particular Forth system;
  • the physical properties of input/output records, files, and units;
  • the physical properties and implementation of storage.

1.3 Document organization

1.3.1 Word sets

This standard groups Forth words and capabilities into word sets under a name indicating some shared aspect, typically their common functional area. Each word set may have an extension, containing words that offer additional functionality. These words are not required in an implementation of the word set.

The "Core" word set, defined in sections 1 through 6, contains the required words and capabilities of a Standard System. The other word sets, defined in sections 7 through 18, are optional, making it possible to provide Standard Systems with tailored levels of functionality.

1.3.1.1 Text sections

Within each word set, section 1 contains introductory and explanatory material and section 2 introduces terms and notation used throughout the standard. There are no requirements in these sections.

Sections 3 and 4 contain the usage and documentation requirements, respectively, for Standard Systems and Programs, while section 5 specifies their labeling.

Sections x.1–x.6 of each word set have the same section numbering as sections 1–6 of the whole document to make it easy to relate the sections to each other. This may lead to gaps in section numbers if a particular section does not occur in a word set.

1.3.1.2 Glossary sections

Section 6 of each word set specifies the required behavior of the definitions in the word set and the extensions word set.

1.3.2 Annexes

The annexes do not contain any required material.

Annex A provides some of the rationale behind the committee's decisions in creating this standard, as well as implementation examples. It has the same section numbering as the body of the standard to make it easy to relate each requirements section to its rationale section.

Annex B is a short bibliography on Forth.

Annex C discusses the compatibility of this standard with earlier Forths.

Annex D presents some techniques for writing portable programs.

Annex H is an index of all Forth words defined in this standard.

1.4 Future directions

1.4.1 New technology

This standard adopts certain words and practices that are increasingly found in common practice. New words have also been adopted to ease creation of portable programs.

1.4.2 Obsolescent features

This standard adopts certain words and practices that cause some previously used words and practices to become obsolescent. Although retained here because of their widespread use, their use in new implementations or new programs is discouraged, as they may be withdrawn from future revisions of the standard.

This standard designates the following word as obsolescent:

15.6.2.1580 FORGET
6.2.2530[COMPILE]
13.6.2.1795 LOCALS|

This standard designates the following practice as obsolescent:

  • Using ENVIRONMENT? to enquire whether a word set is present.

ContributeContributions

UlrichHoffmannavatar of UlrichHoffmann Recognizer RfD rephrase 2020Proposal2020-02-24 09:57:56

Recognizer RfD rephrase 2020

Author: Ulrich Hoffmann
Contact: uho@xlerb.de
Version: 0.8 Date: 2020-02-24 Status: Published

Preamble

This text is a rephrasing of just section XY.2, XY.6, section XY.7 and parts of A.XY of the original recognizer RfD [1] by Matthias Trute that uses terminology and word names closer to that already present in Forth-94 and Forth-2012.

It is not intended to invalidate the susequent RfDs B, C or D [2][3][4]. They reflect the ongoing discussion about Forth recognizers and should be considered valuable documentation of that discussion. This text however is intended to revert the recognizer proposal back to simplicity of concepts and terms making it both easier to understand and use as well as simpler to implement.

This text does not add any new functionality to the original proposal. It merely introduces different terms for the structures already existing in the original proposal. The only difference in functionality is the substitution of the defining word RECOGNIZER: of the original proposal by the word RECOGNIZER (note the missing : ) that - similar to the Forth-94 word WORDLIST - creates a recognizer information token and leaves it on the data stack.

Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.

The following table summarizes the different terms and names:

Term in original proposal Term used here comment
recognizer stack recognizer-order similar to search-order
information token (rit) recognizer information token (rit) explicit and consistent
DO-RECOGNIZER RECOGNIZE avoid hyphen in name
RECOGNIZER: RECOGNIZER similar to WORDLIST, no defining word
R:FAIL UNRECOGNIZED no : in name, better english
REC:xxx recognize-xxx no : in name, better english
R:xxx xxx-recognized no : in name, better english

Items to discuss

  1. Programs that use the word RECOGNIZE (e.g. user-defined text interpreters) most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. For these programs to be portable among standard systems appropriate access words would need to be standardized. [3] and [4] propose such words. Without these access words standardizing the word RECOGNIZE is doubtful. Only standardizing the modified (internal) text interpreter behavior would be sufficient then.

  2. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) create the opaque structure recognizer information token. As an alternative recognizer information token*s could be defined - similar to addresses of counted strings (c-addr) - as special addresses and the structure of memory at that address could be exposed. *recognizer information token could then be created by already existing standard words such as CREATE ALLOT ALLOCATE and would have a known layout, e.g. three xts in sequence: { INTERPRET-XT | COMPILE-XT | POSTPONE-XT }. The access words of 1. would not need to be standardized as each standard program could access the xts using already existing standard words for memory acccess.

  3. The word RECOGNIZER (and the corresponding defining word RECOGNIZER: of [1]) despite its name does not create a recognizer (i.e. a parsing-word plus possible several recognizer information tokens) but a single recognizer information token (triple of interpret/compile/postpone xts characterized by a single-cell value). Another name might reflect this functionality better.

  4. Changes in the standard text interpreter (i.e. that it invokes the word RECOGNIZE internally) has implication on many other words apart from MARKER (e.g. ' ['] EVALUATE INCLUDE-FILE INCLUDED ...). Changes in their behaviour should be mentioned in the propsal. [2] proposes explicit changes for ' ['] MARKER while [3] and [4] have a paragraph describing the implication generally and do not propose i.e. MARKER changes explicitly.

  5. Recognizer information tokens (triple of interpret/compile/postpone xts characterized by a single-cell value) could be named more appropriately. [4] proposes a different name data type id that does not seem to be appropriate. Its general notion seems to mislead into the direction of Forth having a data type system.
    From a classical computer science view recognizers act in the lexical analysis (scanner) phase of a compiler, operating on sequences of characters detecting appropriate lexemes (character subsequences of the input stream) and convert them to tokens. Several lexems might map to the same token (e.g. different sequences of digits map to the token NUM) along with so called attributes (e.g. the value of the number). For this reason tokens are sometimes also called token classes or token types or the kind of the token. These might be good alternative names instead of recognizer information token or data type id. Forth-94 and Forth-2012 use the term ID (as in wordlist-id or file-id) to define characterizing single-cell values so going along the xxx-id would be consistent with existing standard terms. (maybe recognizer-token-id)?

References

[1] Forth Recognizer -- Request For Discussion, Version 1, Matthias Trute, 2014-10-03, access at http://amforth.sourceforge.net/pr/Recognizer-rfc.pdf

[2] Forth Recognizer -- Request For Discussion, Version 2, Matthias Trute, 2015-09-20, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-B.pdf

[3] Forth Recognizer -- Request For Discussion, Version 3, Matthias Trute, 2016-09-04, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-C.pdf

[4] Forth Recognizer -- Request For Discussion, Version 4, Matthias Trute, 2018-08-02, access at http://amforth.sourceforge.net/pr/Recognizer-rfc-D.pdf


Proposal

....

XY.2 Additional terms and notations

Recognizer Information Token: An implementation-dependent single-cell value that identifies the data type and a method table to perform the data processing of the interpreter. A naming convention suggests that the names end with -recognized. Recognizer Information Tokens are abbreviated rit in stack comments.

Recognizer: A combination of a text parsing word that returns recognizer information tokens together with parsed data if successful. The text parsing word is assumed to run in cooperation with SOURCE and >IN. A naming convention suggests that the names start with recognize-.

...

XY.6 Glossary

XY.6.1 Recognizer words

RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED ) RECOGNIZER

Apply the recognizers in the recognizer-order to the string at "addr/len" one after the other. Terminate the iteration if either a recognizer returns a recognizer information token rit that is different from UNRECOGNIZED or the recognizer-order is exhausted. In this case, return UNRECOGNIZED otherwise rit.

"i*x" is the result of the parsing word. It may be on other locations than the data stack. In this case the stack diagram should be read accordingly.

It is an ambiguous condition if the recognizer-order is empty.


GET-RECOGNIZERS ( -- rec-n .. rec-1 n ) RECOGNIZER

Return the execution tokens rec-1 .. rec-n of the parsing words in the recognizer-order. rec-1 identifies the recognizer that is called first and rec-n the execution token of the word that is called last.

The recognizer-order is unaffected.


MARKER ( "name" -- ) RECOGNIZER

Extend MARKER to include the current recognize-order in the state preservation.


UNRECOGNIZED ( -- UNRECOGNIZED ) RECOGNIZER

A constant cell sized recognizer information token with two uses: first it is used to deliver the information that a specific recognizer could not deal with the string passed to it. Second it is a predefined recognizer information token whose elements are used when no recognizer from the recognizer-order could handle the passed string. These methods provide the system error actions.

The actual numeric value is system dependent and has no predictable value.


RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit ) RECOGNIZER

Create a recognizer information token rit with the three execution tokens XT-INTERPRET XT-COMPILE XT-POSTPONE. The implementation is system dependent.

The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE are called with the parsed data that the associated parsing word of the recognizer returned. The information token itself is consumed by the interpreter.


SET-RECOGNIZERS ( rec-n .. rec-1 n -- ) RECOGNIZER

Set the recognizer-order to the recognizers identified by the execution tokens of their parsing words rec-n .. rec-1. rec-1 will be the parsing word of the recognizer that is called first, rec-n will be the last one.

It is an ambiguous condition, if n is not a positive number.

XY.7 Reference Implementation

\ create a simple 3 element structure
\ rit           : XT-INTERPRET
\ rit CELL+     : XT-COMPILE
\ rit 2 CELLS + : XT-POSTPONE
: RECOGNIZER ( XT-INTERPRET XT-COMPILE XT-POSTPONE -- rit )
    HERE >R SWAP ROT , , , R> ;

\ system failure recognizer
: notfound ( i*x -- )  -13 THROW ;

' notfound  ' notfound  ' notfound RECOGNIZER CONSTANT UNRECOGNIZED

\ contains the recognizer-order
\ first cell is the current number of recognizers.
10 CELLS BUFFER: recognizer-order
0 recognizer-order !

: SET-RECOGNIZERS ( rec-n .. rec-1 n -- )
    DUP recognizer-order !
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order +
      ROT SWAP ! 1-
    REPEAT DROP 
;

: GET-RECOGNIZERS ( -- rec-n .. rec-1 n )
    recognizer-order @ recognizer-order
    BEGIN
      CELL+ OVER
    WHILE
      DUP @ ROT 1- ROT
    REPEAT 2DROP
    recognizer-order @
;

: RECOGNIZE ( addr len -- i*x rit | UNRECOGNIZED )
    recognizer-order @
    BEGIN
      DUP
    WHILE
      DUP CELLS recognizer-order + @
      2OVER 2>R SWAP 1- >R
      EXECUTE DUP UNRECOGNIZED <> IF R> DROP 2R> 2DROP EXIT THEN DROP
      R> 2R> ROT
    REPEAT
    DROP 2DROP
    UNRECOGNIZED
;

POSTPONE

POSTPONE is outside the Forth interpreter:

: POSTPONE ( "<spaces>name" -- )
   BL WORD COUNT
   RECOGNIZE
   2 CELLS + @ ( post ) \ get the XT-POSTPONE from recognizer
   EXECUTE
; IMMEDIATE

...

A.XY Informal Annex

A.XY.1 Forth Text Interpreter

The Forth text interpreter turns into a generic tool that is capable to deal with any data type. It maintains STATE and calls the data processing methods according to it.

INTERPRETER
: PARSE-NAME ( -- addr u ) BL WORD COUNT ;

: INTERPRET ( addr len -- i*x rid | unrecognized )
    BEGIN
      PARSE-NAME ?DUP IF DROP EXIT THEN \ no more words?
      RECOGNIZE
      STATE @ IF  CELL+ @  ( comp ) ELSE @ ( interp ) THEN \ get the right XT
      EXECUTE \ do the action
      ?STACK \ simple housekeeping
    AGAIN 
;

A.XY.2 Example Recognizers

Word recognizer
\ find-name is close to FIND. amforth specific.
256 BUFFER: find-name-buf

: place ( c-addr1 u c-addr2 )
   2DUP C! CHAR+ SWAP MOVE ;

: find-name ( addr len -- xt +/-1 | 0 )
   find-name-buf place
   find-name-buf
   FIND DUP 0= IF NIP THEN ;

: immediate? ( flags -- true|false ) 0> ;

\ Define word recognizer

\ INTERPRET
:NONAME ( i*x XT flags -- j*y )
  DROP EXECUTE ;

\ COMPILE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE EXECUTE THEN ;

\ POSTPONE
:NONAME ( XT flags -- )
  immediate?
  IF COMPILE, ELSE POSTPONE LITERAL POSTPONE COMPILE, THEN ;

RECOGNIZER CONSTANT word-recognized

\ parsing word for word recognizer
: recognize-word ( addr len -- XT flags rid | UNRECOGNIZED )
   find-name ( addr len -- XT flags | 0 )
   ?DUP IF word-recognized ELSE UNRECOGNIZED THEN ;

\ prepend the word recognizer to the recognizer-order
GET-RECOGNIZERS ' recognize-word SWAP 1+ SET-RECOGNIZERS

end of document

ruvavatar of ruv 2020-03-01 12:55:28

I think this work on rephrasing, making better terminology and wording, and even reforming the conceptions, — is very important.

My thoughts regarding the terminology are the following.

3. The word "RECOGNIZER".

It seems the most bright conflict in the terminology lays between RECOGNIZER and GET-RECONIZERS words. The former returns rit, the latter returns rec. Such conflict is inadmissible in the Standard. Another term (and another name for the word) should be found instead of "recognizer".

Anyway, holding for this word a kind of semantic similarity to the WORDLIST word looks like a good choice (if any).

5. A name for triple of interpret/compile/postpone xts

This item is closely connected to the above one (3).

It should be taken into account that we already have execution token xt (that identifies execution semantics) and name token nt (that identifies a named definition). I.e., the specification applies the term "token" to an attribute itself (a value only), without the corresponding information about its type (or class, or kind). Hence, the information about the corresponding type should be called "token type" (an identifier of a token type).

Under the hood, this token type identifier should be associated with handlers: how to execute (interpret) the corresponding token, how to compile the corresponding token, etc.

See also my approach in comp.lang.forth post in 2018 (news:pngvcc$pta$1@gioia.aioe.org, copy).

4. Changes in the specifications of other words

Yes, the specification for MARKER should be updated.

But there's no need to mention changes in the behavior of the words that:

ruvavatar of ruv 2020-03-01 17:02:35

The items (1) and (2) are more about API than about terminology.

2. rit structure accessors

I think, suggesting programs to use N CELL+ @ to access xt — is a bad choice. Even for the mentioned counted strings we have the COUNT word, i.e. an accessor. Therefore, even for a transparent structure, if we provide an access, we should provide the access words. (But I think, we don't need access to xts, see 1.ii below)

1. interpret/compile/postpone xts

"Programs [...] most likely need to use the interpret/compile/postpone xts of the returned recognizer information token. [...] Without these access words standardizing the word RECOGNIZE is doubtful."

We have the following issues with this:

i. Nobody provides a use-case

Nobody provides a use-case for the scenario when a program needs access to these xts.

OTOH we have a one important principle: a user-defined text interpreter should be implementable in a standard way (in this case, without re-implementing recognizers). NB: POSTPONE-action is not needed for this text interpreter.

ii. Better to avoid successors

Perhaps another way is better. Instead of using the corresponding xts, use the words that do the corresponding actions:

INTERPRET-TOKEN ( i*x token{k*x} token-type -- j*x )
COMPILE-TOKEN ( i*x token{k*x} token-type -- j*x )
POSTPONE-TOKEN ( token{k*x} token-type -- )

I.e., in place of

( ... rit ) _R>COMP EXECUTE

you just do

( ... rit ) COMPILE-TOKEN

Rationale: in the most cases a user/program needs just to perform these actions. Getting an xt and then executing it has an excessive step without any profit.


Other issues regarding the API and examples.

6. Naming in A.XY.2

find-name was proposed to be a standard word that returns a name token nt.

It is better to use another name for your word ( addr len -- xt +/-1 | 0 ).

I can suggest

find-word-in ( c-addr u  wid -- c-addr u 0 | xt immediate-flag true )
find-word ( c-addr u -- c-addr u 0 | xt immediate-flag true )

Rationale: 1) the FIND word returns c-addr on fail 2) when implementing a text interpreter, on fail you need ( c-addr u ) to convert it into the number; 3) in some cases, always the same number of the result items is an advantage for optimization.

7. Action of postponing isn't essential

i. Why does a user (a program) need to use the POSTPONE-action?

The only known use case is to implement ]] ... [[ construct. But this construct, when implemented via the POSTPONE-action, have a set of known flaws: it doesn't follow copy-pastebility design principle, and (as the result) it doesn't handle the immediate parsing words in a convenient way (including comments, and [IF] ... [THEN]).

Actually, a user doesn't need to use the POSTPONE-action, but he just needs to postpone fragments of code! Therefore, it is better to provide something like c{ ... }c construct (see my s-state PoC) that provides full copy-patebility.

ii. Why does a user (a program) have to specify POSTPONE-action for a new token-type?

The only reason is to make a Forth system aware of how to apply POSTPONE to a user-defined literal. But a user doesn't need to apply POSTPONE to the literals if a Forth system provides a way to postpone any fragments of code.

Well, perhaps a Forth system cannot postpone a user-defined literal (or even a parsing word) as part of a fragment of code, if a user doesn't provide POSTPONE-action? But it is wrong. Since any COMPILE-action is defined via the standard words (and the words defined via standard words), then a Forth system is able to postpone any tokens, having a definition of COMPILE-action for them (see the same c-state PoC)

Yes, it is not quite easy in implementation, but it is very convenient in using!

ruvavatar of ruv 2020-03-05 16:56:05

Another open question

8. Dependency on STATE

Obviously, the results of RECOGNIZE may depend on search order and BASE. Also, a user-defined recognizer may depend on user-defined states.

But what about STATE-dependency for initial recognizer? May the results of RECOGNIZE depend on STATE?

E.g. recognize-word from example in A.XY.2 is based on FIND, and hence it may depend on STATE. And hence in some cases the result is not allowed to be performed in the different STATE (see also my proposal for FIND clarification).

AntonErtlavatar of AntonErtl 2020-05-28 08:14:52

Yes - this text has the potential of starting a bikeshedding discussion but as the recognizer concepts seem to be stable over the last couple of years it is about time to agree on appropriate names and notions.

It seems to me that the seeming stability is treacherous. Stephen Pelc wants to go back to an earlier version of the proposal, and Alex McDonald wants to revise it. We have had renamings already, and as a result, it is not as easy to compare versions as it could be.

However, we have also had a new concept that got the name of an old concept that it replaced: the postpone action was replaced by a time-shifting action (from which the postpone action can be built), but the time-shifting action was still called "postpone action". I am not sure if changing the name now would be helpful, or if it's too late.

StephenPelcavatar of StephenPelc 2020-05-30 12:11:07

I apologise for my delay in responding to Ulli's document.

Overall I think that it's a really good first step, and Ruv's comments are also good.

I do not want to go back to an older proposal in particular, I want a proposal that ordinary mortals can understand. I just want clarity.

Naming

I don't much care whether the recogniser triple is called a rid or a rit. The more neutral term seems to be rid, but rit is now in use, so let's keep to it. Either can be pronounced clearly in discussions.

In normal Forth usage the word that lays the implementation- dependent data would be called RECOGNIZER, (with a comma). It should return a rit. What's the point of having an identifier if we don't use it? If we use this terminology, then the obvious way to refer to return values is RIT-NUM, RIT-FLOAT and so on.

Accessors

Do these ever get used outside the internals of other words? If not, a standard team has no business prescribing these. How many other implementations of a rit exist (in the wild) apart from the xt triple? Ruv's point about a use-case is well taken here.

The POSTPONE action xt is needed for two reasons:

  1. POSTPONE needs it
  2. Not all parsers are for literals, e.g. OOP parsers.

We cannot predict how recognisers will be used, so attempting to automate the POSTPONE actions is doomed to failure. OOP is not a hand-waving prediction, VFX's CIAO and ClassVFX packages both use recognisers.

STATE dependency

Having RECOGNISE be dependent of STATE is horrible.

Reply