Digest #292 2025-03-22

Contributions

[375] 2025-03-21 13:49:08 antonio wrote:

example - First test for D2*

Maybe it was already discussed, but to me the first test should be: T{ 0. D2* -> 0. }T or maybe T{ 0. D2* -> 0 0 }T because the test is the same on both sides.

Replies

[r1417] 2025-01-23 08:00:15 AntonErtl replies:

requestClarification - where definition is compiled?

The standard generally leaves it to the system where it puts the compiled code. It might be in the dictionary, but it could also be elsewhere. Or it could be in the dictionary and elsewhere (e.g., Gforth puts threaded code in the dictionary, and native code elsewhere). The standard gives very few guarantees about this, so it also talks very little about it: There is "code space" in "2.1 Definition of Terms", and there is "3.3.2 Code Space" which does not give any guarantees.

Your best bet at reclaiming code space is to use FORGET or MARKER, but there is not guarantee that these words actually reclaim code space. And given the complexity of implementation arising from that, I am thinking about changing Gforth such that it does not reclaim the native code when MARKER is used.

I think this answers the question, so I am closing it. If there is anything unclear yet, write a reply and reopen it.


[r1418] 2025-01-23 08:09:47 AntonErtl replies:

requestClarification - `NAME>STRING` result is transient

I think the idea was to allow systems that store definition names in a representation other than that returned by NAME>STRING. One example would be the fig-Forth representation of names, which sets the high bit of the last byte (fig-Forth only supports names in ASCII).

Now that we have had NAME>STRING in the standard for a decade, we can look at the systems that actually implement this word. If they all return a name that lives as long as the definition, we could enhance this word by giving that guarantee. But who will examine the systems and make the proposal?


[r1419] 2025-01-23 08:29:12 ruv replies:

requestClarification - where definition is compiled?

how it is implemented, specially where the definition list of the word created using :NONAME is compiled.

The standard intentionally does not specify many options, which fall into implementation-defined options (which shall be documented) and implementation-dependent options (which might be undocumented).

Simple implementations typically reserve a big memory region for dictionary and use data space for code space too. Then, in direct and indirect threaded code, the words compile, and lit, is defined simply as:

: compile, ( xt -- ) , ;
: lit, ( x -- ) ['] lit compile, , ;

And

:noname 2 * ; 

is equivalent to

:noname [ 2 lit,  ' * compile, ] ;

is there any way to free the memory?

This is possible using marker:

marker restore-dict

: foo 123 . ;
foo \ prints "123"

restore-dict

foo \ error: not found
restore-dict \ error: not found

In some Forth system you can create many dictionaries and free them independently of each other.


[r1420] 2025-01-23 09:57:57 ruv replies:

requestClarification - `NAME>STRING` result is transient

I have checked. There are about 22 Forth systems on GitHub that provide name>string implemented in Forth (see the search results).

Among these systems there is only one system, namely solo-forth (description: "Standard Forth system for ZX Spectrum 128 and compatible computers, with disk drives"), in which the word name>string returns a string in a transit buffer. This system has to copy the resulting string into the transient buffer because the header space and the data space are located in different address spaces.


[r1421] 2025-01-23 14:13:32 ruv replies:

requestClarification - `NAME>STRING` result is transient

So, the only advantage of a transient result is that it allows to save memory in some cases. And it only makes sense when saving 10-100 KiB of memory (say, 8% of compiled code size) matters.

An example of approach that can benefit is the use of a trie data structure (prefix tree) to implement efficient searching of word lists. This data structure does not need to store entire strings. Therefore, to save memory, a transient string for a word name can be constructed each time it is needed.


[r1422] 2025-02-13 13:14:50 ruv replies:

proposal - minimalistic core API for recognizers

Re interpreting

(similar arguments apply to the word compiling too)

From the proposal's "Problem" section:

The Forth interpreter is stateful, but the API should avoid the problems of the STATE variable. In particular, an implementation without STATE should be possible, and there is only one place where the stateful dispatch is necessary.

We should consider that Forth words may do stateful dispatch by themselves and they may rely on the value of STATE. Usually, the Forth system itself cannot determine whether a user-defined word perform stateful dispatch. Therefore, it is essential for the Forth system to ensure the STATE variable is correctly set to reflect the formal state of the Forth text interpreter when executing a user-defined word.

The assumption that the value of STATE is irrelevant when xt-int is executed—because xt-int does not perform stateful dispatch itself—is flawed. This is because, when xt-int is executed, it may invoke a user-defined word that performs stateful dispatch.

The suggested word INTERPRETING ( j*x xt -- k*x ) is confusing and useless, because it just executes xt-int (obtained from xt) and does not ensure that the value of STATE is 0 before a user-defined word is invoked by xt-int.

I suggested the word execute-interpreting that applies to any xt. When it applies to a token translator, the corresponding interpretation semantics are performed. And this correctly works even when a user-defined word that performs stateful dispatch is invoked by the token translator.


From the rationale to TRANSLATE::

The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by EXECUTE to keep the API small. You can not simply set STATE, use EXECUTE and afterwards restore STATE to perform interpretation or compilation semantics, because words can change STATE, so you need the words INTERPRETING and COMPILING defined below.

The provided specification does not guarantee that interpreting and compiling solve this problem, and in the reference implementation they do not solve the problem.

The words execute-interpreting and execute-compiling solve the problem, and they do not need to know xt-int or xt-comp from a token translator.

Re "postponing" state

  1. What is a rationale to formally introduce "postponing" state? If you need it only for ]] ... [[, then it's better to extract them into a separate proposal, and also provide a ground why this approach is better than implement ]] as a parsing word.

  2. Why do you need to specify that ]] changes STATE to a third value if user-defined words do not see this value and cannot analyze this value?

Re set-state and get-state

It's unclear how these words can be used.

It seems that the word SET-STATE is underspecified. Also, it's name is confusing, because it formally is not allowed to change STATE (at the moment).

Re translators and translate:

translator: named subtype of xt, and executes with the following stack effect: name ( jx ix – k*x )

Why do you require a translator to be named? I use anonymous token translators (defined as quotations) and find them very useful.

From the spec to TRANSLATE::

Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.

It's necessary to define what a "general purpose translator" is, and it should be clear how a translator that is not a general purpose translator can be defined in a standard way.

Minimize core

To reduce the scope of discussion, we should minimize the core API.

So, it is better to put the words ]], [[, RECOGNIZER-SEQUENCE, etc, to separate proposals.

But a recognizer for local variables should be added, because they are already standardized and a Forth system that supports local variables must recognize them.


[r1423] 2025-02-13 14:17:51 GeraldWodni replies:

proposal - minimalistic core API for recognizers

@ruv: I agree, that it would be nicer to have sequences etc. not in the proposal. However: They are a good example of how recognizers can be used in practice, which makes me think they should stay inside to give some guidance to rec:newbies. We can still ask Bernd for further modifications, but I think we should do so grouped together after the meeting, to avoid unnecessary edits.


[r1424] 2025-02-14 11:21:35 BerndPaysan replies:

proposal - minimalistic core API for recognizers

I don't see a problem to separate the proposal into several smaller ones, especially taking optional parts out that belong together.

The postpone mode can indeed either implemented loop-style (i.e. like PolyForth's ]), or with a state; it shouldn't be necessary to specify the details.

If you have STATE-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves having STATE as expected, you can't just call INTERPRETING or COMPILING on a translator or use some table index mechanism as in the Trute proposal to call the right slot.

If you don't have such things and have a Forth system where STATE-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE-smart words yourself, you can actually use that API. That's why I think such an API can be actually standardized before we make STATE obsolescent and have standardized replacements available.


[r1425] 2025-02-14 15:18:17 BerndPaysan replies:

proposal - minimalistic core API for recognizers

can actually be standardized

I mean can't. We need to phase out STATE and define possible replacements before we can have a STATEless API.


[r1426] 2025-02-15 09:57:48 AntonErtl replies:

proposal - minimalistic core API for recognizers

At the online meeting on 2025-02-13 I was asked to present a subproposal for factoring the state-dependent component out of TRANSLATE:.

There are many possible ways to skin this cat, e.g., the one in Matthias Trute's proposal, or the way that present proposal used up to v4 and earlier. Here I present a way that requires relatively few changes to the current version of this proposal.

XY.3.1 Definition of terms

Replace the definition of translator with:

translator: a cell-sized opaque token that represents how a recognized lexeme can be interpreted, compiled, or postponed. A translator usually needs additional data about the recognized lexeme that is deeper in the stacks.

Replace uses of translator-xt in ?NOTFOUND with translator, and likewise for other words that, in [r1412], consume or push the xt of a translator.

XY.6 Glossary

TRANSLATOR:

Replace the definition of TRANSLATE: with

TRANSLATOR: ( xt-int xt-comp xt-post "<spaces>name" -- )

Skip leading space delimiters. Parse name delimited by a space. Create a definition for name with the execution semantics defined below.

name is referred to as translator.

name Execution: ( -- translator )

translator represents a translator with interpretation action xt-int, compilation action xt-comp, and postpone action xt-post..

Modified words:

INTERPRETING ( i*x translator -- k*x )

Execute xt-int of translator.

COMPILING ( j*x translator -- l*x )

Execute xt-comp of translator.

POSTPONING ( j*x translator -- )

Execute xt-post of translator.

STATE-TRANSLATING

Add:

STATE-TRANSLATING ( i*x translator -- j*x )

Remove translator from the stack.

If the system has a postpone state, and is currently is in postpone state, execute xt-post of translator.

Otherwise, if the system is in interpretation state, execute xt-int of translator.

Otherwise, execute xt-comp of translator.

Discussion

The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside (compared to [r1412]).

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

Typical use

The standard interpreter loop:

: interpret ( i\*x -- j\*x )
  BEGIN  parse-name dup  WHILE  forth-recognize ?found state-translating  REPEAT
  2drop ;

Implementation of POSTPONE is the same as in the existing proposal:

: postpone ( "name" -- )
  parse-name forth-recognize ?found postponing ; immediate

The implementation of ' becomes slightly shorter (no need to tick translate-nt:

: ' ( "name" -- xt )
  parse-name forth-recognize ?found
  translate-nt <> #-32 and throw
  name>interpret ;

Now for interpreter loops that do not use STATE.

First, the polyForth division of interpreter and compiler:

: parse-name-refill ( -- c-addr u )
  begin
    parse-name dup 0= while
      2drop refill 0= if
        0 0 exit then
  repeat ;

: ] ( i\*x -- j\*x )
  BEGIN
    parse-name-refill dup while
      2dup "[" str= 0= while
        forth-recognize ?found compiling
  REPEAT
  2drop ;

: pf-interpret ( i\*x -- j\*x )
  BEGIN  parse-name-refill dup  WHILE  forth-recognize ?found interpreting  REPEAT
  2drop ;

And here's one for colorforth-bw:

: cfbw-interpret ( i\*x -- j\*x )
  begin
    parse-name dup  while
      over c@ >r 1 /string forth-recognize ?found r> case
        '[' of interpreting endof
        '_' of compiling endof
        ']' of postponing endof
        -13 throw
      endcase
  repreat ;

The problem with these interpreters is that there is no standardized or proposed way to plug this interpret into the existing infrastructure (e.g., included), so the benefit of being able to write this is limited to one line (in case of colorforth-bw) or the rest of the file in case of the polyForth-style interpreter.

But the recognizer proposal allows to replace forth-recognizer, and this allows us to plug in colorforth-bw into the text interpreter until further notice. I presented a way to do it with an earlier version of this proposal in [r1397], here's a way for doing it with [r1412] modified by this sub-proposal:

defer recognizer1 action-of forth-recognize is recognizer1

: translator-bw1 ( i\*x translator c -- j\*x )
  case
    '[' of interpreting endof
    '_' of compiling endof
    ']' of postponing endof
    -13 throw
  endcase ;

' translator-bw1 dup dup translator: translator-bw

: recognize-colorforth-bw ( c-addr u -- translator )
  dup 0= if 2drop 0 exit then
  over c@ >r 1 /string recognizer1
  r> over if translator-bw else drop then ;

' recognize-colorforth-bw is forth-recognize

Reference implementation:

A straightforward implementation is:

: translator: ( xt-int xt-comp xt-post "\<spaces\>name" -- )
  create , , , ;

: state-translating ( i\*x translator -- j\*x )
  state @ if compiling else interpreting then ;

This does not cover a potential postpone state; if a system has a postpone state and can enter the standard text interpreter in this state, then the implementation of state-translating should be extended accordingly.

Of course, this implementation of state-translating is far too inefficient for some tastes, so here's a more clever one:

: state-translating ( i\*x translator -- j\*x )
  2 state @ 0<> + cells + @ execute ;

For even more efficiency we can redefine `]' and '[':

defer state-translating

: [ ( -- )
  [ ( old implementation ) ['] interrpreting is state-translating ; immediate

[ \ initialize state-translating

: ] ( -- )
  ] ( old implementation ) ['] compiling is state-translating ;

If there is a word that sets the postpone state, that word should also set state-translating accordingly.

There are also a changes involving words that push literal translator tokens. In [r1412] the translator word needs to be ticked, in this subproposal you do not do that. E.g., rec-nt now looks as follows:

: rec-nt ( addr u -- nt nt-translator | 0 )
  forth-wordlist find-name-in dup IF  translate-nt  THEN ;

[r1427] 2025-02-16 18:45:17 AntonErtl replies:

proposal - minimalistic core API for recognizers

STATE-dependence

[r1412] still contains a defining word for state-dependent translators (and none for translators without this mistake), which are unacceptable to me. I have suggested an improvement in [r1426].

Dividing the proposal?

There have been some discussions about dividing the proposal. I don't think that that's a good idea for the discussion, but in usage I see the division into the following hierarchy of use cases, which require different words; the later use cases usually require also implementing the words for the earlier use cases:

  1. Programs that use the default recognizers. For them we need to specify a standard recognizer sequence (including how to deal with locals): REC-NT REC-NUM REC-FLOAT (if present) corresponds to Forth-2012. I expect that systems that have REC-STRING and REC-TICK to put these into their recognizer sequence, too. How do we document in the program documentation which recognizers are needed? Probably we need to extend the program documentation requirements (until now the recognition of doubles, floats and locals has been coupled with documenting the double, float and local wordset, respectively, but for REC-STRING and REC-TICK that's probably not the way to go).

    The new POSTPONE is also at that usage level.

  2. Programs that change which of the existing recognizers are used and in what order. For them we need the names of the existing recognizers (not sure about the translators), FORTH-RECOGNIZE, SET-RECOGNIZER-SEQUENCE, GET-RECOGNIZER-SEQUENCE, .RECOGNIZERS (not yet proposed) and maybe RECOGNIZER-SEQUENCE:. If all the standardized recognizers are in FORTH-RECOGNIZE by default, there will probably not be much of this kind of usage, except maybe to put REC-FLOAT in front of REC-NUM (to recognize "1." as float; REC-FLOAT would have to be to defined in more detail for that to work).

  3. Programs that define new recognizers that use existing translators. This usage needs the names of the translators.

  4. Programs that define new translators. This usage needs TRANSLATE: (or TRANSLATOR:).

  5. Programs that define text interpreters and programming tools that have to deal with recognizers (such as a recognizer-aware postpone). These programs need INTERPRETING, COMPILING, POSTPONING or STATE-TRANSLATING.

A system with recognizers is a program of all these types, so all these words will be present in every such system (with the exception of some recognizers and related translators), so there is little point in making most of these words optional (except rec-float, rec-string, rec-tick and translators used only by those recognizers). But it is still a good idea to present the words divided by these usages. We usually present words in alphabetical order in the document. Should we continue this tradition for these words? If so, the division of words above should probably be documented in the rationale.

For word counters

Given that usage 5 above is rare in user programs, word counters may prefer to replace the four words INTERPRETING, COMPILING, POSTPONING or STATE-TRANSLATING with one word

TRANSLATING ( ix translator n -- jx )

where

  • 0 TRANSLATING is equivalent to INTERPRETING

  • -1 TRANSLATING is equivalent to COMPILING

  • -2 TRANSLATING is equivalent to POSTPONING

  • STATE @ 0<> TRANSLATING is equivalent to the reference implementation of STATE-TRANSLATING

A simple Forth system has only one use of POSTPONING (in POSTPONE) and one use of STATE-TRANSLATING (in INTERPRET), so defining 4 words for the purpose may seem excessive. And replacing them with TRANSLATING saves a tiny bit of source code and memory.

OTOH, there is no standard way to use TRANSLATING for STATE-TRANSLATING in the general case, where the system has a postpone state, because there is no standard way to determine postpone state. Moreover, the specification of TRANSLATING is not so nice (that's why I left it out in the above), and the code using it will be less readable.

Gerund

It's not clear to me why the gerund form is used (INTERPRETING etc.), although I kept with it for my suggestions (for consistency). I would use an imperative form; and because "interpret", "compile" and "postpone" are already taken, maybe something like TRANSLATOR>INTERPRET or somesuch, which would parallel NAME>INTERPRET. However, the latter pushes an xt, the former executes it, so either we let TRANSLATOR>INTERPRET also produce an xt, or use a slightly different naming scheme, such as TRANSLATOR*INTERPRET.

GET-STATE SET-STATE

It's unclear what get-state and set-state do, and their names suggest a stack effect ( -- f ) and ( f -- ).

The reference implementation does not make that any clearer; in particular, the reference implementation of set-state does not make any sense at all, and I would not know why anybody would want to use get-state.

[IF] parts

This makes the proposal hard to understand and discuss. Take a decision (possible after asking around, but I doubt that anyone but you and maybe ruv has a proper basis for an opinion), put it in the proposal, and give a rationale for the decision in a section Discussion.

Side effects

I do not see a good way to specify in the normative part of the document that a recognizer must not have a side effect. The proposal mentions "supposed to" and "promise". The normative part says what specific words do (or there is an ambiguous condition). It seems to me that the discussion about side effects should go into the non-normative rationale. It's clear enough what happens when somebody uses a word that invokes a recognizer, and that recognizer has a side effect; no need for an ambiguous condition.

NOTFOUND

I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?

FORTH-RECOGNIZE, deferred or getter and setter?

I see no benefits to having a getter and setter here. Deferred words are fine.

Presentation

The "Solution" chapter is not comprehensible except to those deep into the discussion: It is full of unexplained terms, such as "data parsing", "token type". And "translator" is not comprehensible to anybody who comes fresh to the proposal, and even to those who have seen some earlier recognizer proposals.

The second part of "Solution" should be a separate section "Transition for some implementors/users of Matthias Trute's proposal".

More NOTFOUND stuff

The proposal defines ?FOUND, ?NOTFOUND, and NOTFOUND only for NOTFOUND=0. This looks like a bug to me.

The stack effect of ?FOUND and other words: We do not have "never" in the standard. What's that supposed to mean?

XY.3.1 Translator

"named subtype"? What's that? The rest of the wording is woefully inadequate. A careful specification would reveal the complexity that you get with state-dependent translators.

?NOTFOUND

?NOTFOUND has a horrible stack effect. This word is not shown in any typical use examples? Is it needed? If it is needed, maybe the stack effects of the other words can be changed to make it unnecessary; although, admittedly, when I worked on combining recognizers, I did not find a solution with a nice stack flow (and I have tried). Hmm, maybe with a variant of case with a specialized variant of of?

POSTPONE

"if the exception wordset is not present". The exception wordset has been a required part of Forth200x for several years.

SET-RECOGNIZER-SEQUENCE

As specified, the sequence will always fit. Can the sequence fail to fit? If so, specify what happens.

REC-NUM

Should this be the all-singing, all-dancing variant (including doubles, number prefixes and '<char>')? Given existing practice and the legacy code base, yes. OTOH, with recognizers it seems a conceptually attractive option to have the rec-num be a decomposable sequence consisting of the various cases. But given nestable recognizer sequences, that's always an option for the future.

SCAN-TRANSLATE-STRING

This should follow C conventions for newlines like the rest of the string syntax, i.e., escape newlines with \. If other conventions are desired (e.g. what may or may not be JSON syntax), that would be for another recognizer and another translator.

The specification should be clear about what it does: "REFILL can be used to read in more lines" is neither here nor there.

TRANSLATE-STRING ?SCAN-STRING

What are these words good for? REC-STRING apparently does not need them.

[[

A word without interpretation nor compilation semantics?

Should we specify whether there is a postpone state, or alternatively that ]] has its own text interpreter loop? There are ways to distinguish these two kinds of implementation; does it matter? Maybe if you want to EVALUATE something in postpone state or somesuch.

]] and [[ should probably go into a separate proposal.

STATE

Changing the specification of state such that there is at least one non-zero value that does not mean "compilation state" is not an extension of the current specification of state, but a change. However, existing practice of systems which use -2 as postpone state suggest that this does not break existing code in practice. That's probably because so little existing code actually uses postpone state. With wider use of postpone state, some breakage may actually turn up.

The safe option would be to represent postpone state (if we have it at all) in a way other than through a value of STATE. E.g., have another variable POSTPONE-STATE: if it's false, then STATE determines the state; if it's true, the system is in postpone state.

In any case, if we put ]] in another proposal, that's where we should have this discussion.


[r1428] 2025-02-16 23:03:46 BerndPaysan replies:

proposal - minimalistic core API for recognizers

Multiline strings

I don't think C is setting a good example. Nobody took C's syntax for proper multiline strings, not even C++. C is still an important legacy language, but COBOL also is in the top 20. You don't want to have multiline strings like COBOL.

  • C++11 got raw strings, and gcc supports them even in C. The syntax has a R"( as start, and a )" as end (with the option of adding more letters to disambiguate the string ending). Raw strings don't translate backslash+characters, which is often what you want, because the multiline string is actually some other programming language, and the editor is fine inserting all the characters you want there without escapes. Note that you need some way to disambiguate the string ending in a raw string, as you can't escape ".
  • Rust, Visual Basic (≥14), R, Ruby, and PHP strings are multiline by default (inserting newlines where the string has line breaks)
  • JavaScript (using template literals) and Go uses ` (backtick) for multiline (raw) strings
  • C# uses @" to start a multiline string
  • SQL uses ' (single quote) for multiline strings
  • Java 15 has text blocks (with """ as start and end)
  • Python use either """ or ''' or for multiline strings

Nobody makes proper multiline strings like C. Really nobody. Not even recent C compilers, they follow C++. I'm now at item 20 of Tiobe index, and most languages nowadays have multiline strings one way or the other. Getting Emacs to recognize multiline strings was easy: Just remove the \n from the end of string pattern. Emacs likes multiline strings. JSON-variants with multiline strings are likely from developers that use Ruby or PHP. You have to deal with this sort of stuff.

The most popular option seem to be multiline strings by default, when legacy (e.g. through a C-like syntax) isn't a problem. As we are adding a new syntax for string literals, we don't need to care about backwards compatibility. One popular feature is to remove blanks from auto-indented strings, as editors indent these strings. Strictly speaking, if we support non-raw multi-line strings, we could even parse C strings, if a \ as last character is defined as “don't add a newline here” (instead of “unfinished escape sequence”).


[r1429] 2025-02-17 00:37:20 ruv replies:

proposal - minimalistic core API for recognizers

Bernd writes:

If you have STATE-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves having STATE as expected, you can't just call INTERPRETING or COMPILING on a translator or use some table index mechanism as in the Trute proposal to call the right slot.

Right.

If you don't have such things and have a Forth system where STATE-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE-smart words yourself, you can actually use that API.

In Forth, you almost always have such things, because you have EVALUATE and INCLUDE-FILE, which depend on STATE.

INCLUDE-FILE translates a file, EVALUATE translates a string. In practice, it's also necessary to translate a single lexeme, or even a single semantic token (like a number, xt, nt).

Bernd writes:

We need to phase out STATE and define possible replacements before we can have a STATEless API.

Recognizers already don't depend on STATE. Only some token translators depend on STATE. But we cannot avoid them in Forth system, and cannot eliminate STATE.

The existence of interpretation semantics and compilation semantics of Forth words is associated with two modes (states) of the Forth text interpreter: interpretation state and compilation state. The only way to essentially eliminate STATE is to eliminate one of these modes and the corresponding semantics. For example, one could remove interpretation state and interpretation semantics of words. This is possible, but the resulting language will not be backwards compatible with Standard Forth, since any parsing word must be an "immediate" word in this language.

For example, without interpretation state it's impossible to translate the following program:

: my'  ['] ' execute ;
my' my' constant mytick-xt

Changing the search order outside of definitions is also problematic:

also myvoc myword ( x ) previous  constant my-x

In this line, myword must be recognized in the modified search order. This is only possible in interpretation state, which means that the next lexeme is recognized only after the previous lexeme has been recognized and executed.

Factor is an example of a Forth-like language without interpretation sate. There, ordinary words are always "compiled" (added to AST), parsing words (and syntax words) are always immediately executed. See: Factor / Syntax / Parser algorithm.


[r1430] 2025-02-17 01:31:51 ruv replies:

proposal - minimalistic core API for recognizers

Anton writes

Add: STATE-TRANSLATING ( ix translator -- jx )

Why is this better than making translator a subtype of xt, and using EXECUTE instead of STATE-TRANSLATING?

The benefits of making translator a subtype of xt:

  • no need for a separate word (for word counters);

  • a translator can be defined as a quotation or anonymous definitions (sometimes this is very convenient);

  • a new translator can be simply defied using other translators;

    • an example for illustration the idea:
      : translate-2lit ( 2*x -- 2*x | )
        >r translate-lit r> translate-lit
      ;
      
      in some my implementations example, postpone correctly applies to a lexeme that is recognized into a qualified semantic token with this translator.
  • the Forth text interpreter loop can be re-used for other purposes;

    • just for illustration, reuse the Forth text interpreter to count lexemes in a string:
      : count-lexemes ( sd.string -- u )
       0 rot rot  ['] example.evaluate [: 2drop 1+ ['] noop  ;] apply-perceptor
      ;
      s" a b c d" count-lexemes . \ prints "4"
      
      See the apply-perceptor word definition in recognizer-api-ext.fth

[r1431] 2025-02-17 06:33:43 ruv replies:

proposal - minimalistic core API for recognizers

Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating against `executefor the API users.

Please note that this is irrelevant to the question of whether translator is a subtype of xt or not.

For example, Translator is 1+

``` : count-lexemes ( sd.string -- u ) 0 rot rot ['] example.evaluate [: 2drop ['] 1+ ;] apply-perceptor ;

I provided above an example of a translator that is an xt, and

the only difference to the API users is whether state-translate or execute is used. And the latter allows provides more useful use cases.


[r1432] 2025-02-17 06:45:34 ruv replies:

proposal - minimalistic core API for recognizers

The above message is a draft that was sent accidentally. A better edition is below -)

Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:

The benefit of factoring out state-translating is that the state dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.

This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating against execute for the API users.

For example, in the case of execute a translator can be even as simple as 1+:

: count-lexemes ( sd.string -- u )
 0 rot rot  ['] evaluate [: 2drop ['] 1+ ;] apply-perceptor
;

This translator is not infected either by state or by a dummy triple ( xt-int xt-comp xt-post ) .


[r1433] 2025-02-17 15:29:33 BerndPaysan replies:

proposal - minimalistic core API for recognizers

I use recognizers for non-Forth languages. These languages are usually state-free, i.e. they are interpret- or compile-only. Using a quotation for the translator is completely sufficient. E.g. the recognizer in net2o's chat message that matches URLs has

[: rework-% $, msg-url ;]

as translator. No need to define a triple-entry translator table. And the translators are indeed all that short, and there's no reusability (a token translates 1:1 to a command plus a way to add the corresponding data). This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it (I ended up with the generic name-translator, and just put the xt I wanted to execute on the stack underneath, so it worked in interpretation state, but was actually compiling message into a buffer). The text messages are parsed by standard EVALUATE, but a language-specific recognizer stack that has no single Forth recognizer in it.

Therefore I disagree with Anton that the current translator concept ties STATE to every translator: it's the other way round. It ties them only to full-blown Forth translators that work in a mixed interpreter/compiler language, where there is a state (and there, it is inevitable, and you can move that dispatch only around). You can define translators used by Forth with TRANSLATE:, but you can define translators used by other (single-state) languages just as ordinary xt, and with a single action for translation. There's no need for the table and dispatch if your language has no state at all, it's just EXECUTE of the one single action.

When you want to reuse slots of system translators in Gforth (e.g. for a Color-Forth clone), you can use action-of interpreting/compiling/postponing ( translator -- xt ) to access the subfields. That's, because all these accessing words are just identical to the defer field for value-style structures.

E.g. Anton's example could be

: cf-recognizer ( <[_]>addr u -- data translator | 0 )
  sp@ fp@ {: sp' fp' :} over c@ >r 1 /string recognizer1
  dup 0= IF   rdrop  EXIT  THEN
  case r> '[' of  action-of interpreting  endof
          '_' of  action-of compiling     endof
          ']' of  action-of postponing    endof
          fp' fp! sp' sp! 2drop 0 dup
  endcase ;

and that works (the vocabularies used by the colorForth core wouldn't have any STATE in it). That way, you don't need to write your own outer interpreting colorForth, the standard Forth interpreter does it.

My design assumption was that making all new data types (recognizer sequences, translators) subtypes of xt, and therefore executable, will pay off, and it did.


[r1434] 2025-02-19 21:50:22 AntonErtl replies:

proposal - minimalistic core API for recognizers

Multiline strings

Checking on Python3, I see that it uses C's syntax for strings starting with ". In particular, if you just do a newline in the middle of a string without escaping the newline, you get an error:

>>> print("abc
  File "\<stdin\>", line 1
    print("abc
          ^
SyntaxError: unterminated string literal (detected at line 1)

An escaped newline is ignored, and you need to write \n to get an actual newline. I expect that it's the same for most other languages you mention, because they all use a different syntax for "proper multiline strings". I have no problem with an additional recognizer for "proper multiline strings" with a distinguishable syntax (such as """); I can even live with rec-string doing the additional syntax, but I think that there might be others who will disparage it as a WIBNI or somesuch.

But I think that, for "-delimited strings, rec-string should either not do multi-line strings at all or do it the C/Python3/etc. way.

STATE-TRANSLATING

Why is this better than making translator a subtype of xt, and using EXECUTE instead of STATE-TRANSLATING?

It is better because it isolates the state-dependence in the word(s) calling state-translating rather than having it in the translator coming out of the recognizer and potentially being invoked through any execute, compile,, is or defer! in the system (with data-flow analysis necessary to reduce the number of potential invocations, and the result of that analysis probably still showing more occurences than what searching for state-translating would otherwise give us).

It's similar to the difference between arming a bomb at the factory, or arming it only just before dropping it (which may never happen).

Examples of translators not produced with translate:

The proposal states about translate:

This word is the only standard way to define a general purpose translator.

Any argument based on defining translators in other ways is therefore not in line with the proposal.

This applies to [r1432] as well as [r1433].

So the usages you show may work on some particular implementation, but may fail on a different implementation of the proposal.

And if you are willing to design an implementation for some convenient code of your interpretation-only recognizers, I am sure that your are able to design an implementation of recognizers with state-translating that's just as convenient.


[r1435] 2025-02-20 16:40:03 ruv replies:

proposal - minimalistic core API for recognizers

Making translator a subtype of xt

Why is this better than making translator a subtype of xt

It is better because it isolates the state-dependence in the word(s) calling state-translating rather than having it in the translator coming out of the recognizer and potentially being invoked through any execute, compile,, is or defer! in the system (with data-flow analysis necessary to reduce the number of potential invocations, and the result of that analysis probably still showing more occurences than what searching for state-translating would otherwise give us).

1. It does not isolate the state-dependence in the word(s) calling state-translating — because interpreting and compiling will also exhibit state-dependent behavior on some arguments. At the same time, state-translating will exhibit state-independent behavior on some arguments.

2. If the user prefer the word state-translating because it allows him to find invocations of translators in his code, he can define this word as synonym state-translating execute and use in his code.

3. Given the choice between the ability to find invocations of translators and the set of benefits that an xt subtype provides, I would prefer the latter.

Any argument based on defining translators in other ways is therefore not in line with the proposal.

Yes, but they are aimed at changing the proposal ))

And if you are willing to design an implementation for some convenient code of your interpretation-only recognizers, I am sure that your are able to design an implementation of recognizers with state-translating that's just as convenient.

Bernd wrote: "This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it".

Side effects

Anton wrote:

It seems to me that the discussion about side effects should go into the non-normative rationale.

Agreed. A standard word cannot have an unspecified side effect that can be detected by a standard program. Therefore, it's sufficient to specify the allowed effects for standard recognizers and for the perceptor in the standard Forth system.

NOTFOUND

Anton wrote:

I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?

In Matthias Trute's proposal I don't see any arguments why NOTFOUND is better than zero.

See my arguments why zero is better in the section "Special data object on failure considered harmful" of my comment [r1351] 2024-10-08.

FORTH-RECOGNIZE, deferred or getter and setter?

Anton wrote:

I see no benefits to having a getter and setter here. Deferred words are fine.

Anton, you wrote in comp.lang.forth on 2024-10-05: "I wish they had defined GET-BASE and SET-BASE instead of BASE".

You seem to see shortcomings of BASE. The shortcomings of the deferred word FORTH-RECOGNIZE are similar: if additional actions are needed on set or get the value, this is difficult to implement in a system and almost impossible in a program. And this word cannot be redefined by a program.


[r1436] 2025-02-20 21:39:26 BerndPaysan replies:

proposal - minimalistic core API for recognizers

Multiline Strings

Anton, you seemed to miss the largest group that does it identical: Rust, Visual Basic (≥14), R, Ruby, and PHP. A number of languages who started with C-like strings did not continue to follow that example and then obviously needed a different syntax to stay backwards compatible (VB didn't have C-like strings to begin with and the multiline extension was compatible). If we introduce multiline strings as new feature, we should not first copy a bad example and then add another syntax to fix that.

The primary reason why C's multiline string style is so weird is the preprocessor: the preprocessor has a rudimentary understanding of the language, and it uses \ at the end of the line to concatenate multiple lines. That makes it create single-line entries out of strings, and then it understands that all this is just one string it shouldn't look inside (C macros aren't replaced in strings).

There's absolutely no need to copy C's weird strings caused by their weird preprocessor approach into Forth.


[r1437] 2025-02-22 09:09:43 AntonErtl replies:

proposal - minimalistic core API for recognizers

Checking Ruby (the only of this bunch of languages that I have installed), I see that it indeed has strings that include newlines. So yes, there are programming languages that allow unescaped newlines in their most popular string syntax, instead of introducing a separate syntax for multi-line strings.

Why is this a mistake? The common case is that a string ends on the same line where it started. If the string terminator is missing on that line, it is often a mistake, and a friendly programming language has a syntax for single-line strings that allows catching the mistake right on that line. By contrast, in Ruby I get:

[~:155788] ruby
puts 'hello,
puts 'world'
-:2: syntax error, unexpected local variable or method, expecting end-of-input
puts 'world'

So it gives me a misleading error message on a different line from where the mistake happened, possibly several lines later. That's why many languages require either escaping the newline or a different syntax for multi-line strings.

The reason for that has nothing to do with backwards compatibility: These languages report an error if there is an unescaped newline in a string with the most popular syntax. Defining that case to do what you want rather than as an error does not break any existing, working programs.

The reason also has nothing to do with the C preprocessor. The C preprocessor has to know when something is inside or outside a string (it must not do macro expansion inside string literals), so it could just as well accept a newline inside a string.

I don't have an opinion on whether we should use an alternative delimiting syntax for multi-line strings, escape the newlines in the syntax for single-line strings, or have both options.


[r1438] 2025-02-22 10:04:11 AntonErtl replies:

proposal - minimalistic core API for recognizers

Making translator a subtype of xt

  1. Yes, having the translators not being state-dependent does not prevent people from performing state-dependent code (and it should not). But what it gives me is that if I avoid such code (and I do), I do not have to worry about state-dependence in every execute, compile,, is and defer!, only in state-translating.

  2. Defining state-translate as an alias of execute does not help, because the state-dependence is in the words defined with translate:. Every other execute, compile,, is or defer! might still do something state-dependent because of that even if I have no other source of state-dependence in my program.

  3. My preference is for translators without state-dependence. As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers without state-dependent translate: children that's just as convenient.

NOTFOUND

After complaints about the proposal being too long, Matthias Trute removed lots of the rationale in one version of his proposal, including the rationale for not using 0. Of course the same person who had earlier complained about the length then complained about incomprehensibility. Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.

FORTH-RECOGNIZE, deferred or getter and setter?

For base, for the optimization I have in mind, one would have to check on every use of # whether base has changed in the meantime, and that cost would be substantial compared to the benefit of the optimization. And it's not just set-base that would avoid the problem: If base was a uvalue (not a uvarue), it would be relatively easy to eliminate the check in Gforth.

For forth-recognize, I have no such optimization in mind. If I had, it would be relatively easy to implement in Gforth without change check for forth-recognize, because forth-recognize is a deffered word, not a variable.

But even on a system where you cannot attach the optimization to the defer! method of forth-recognize, inserting a change check would be much less of a problem than for base: forth-recognize tends to be much more expensive than #, so any optimization with noticable benefit will also reduce the cycles per invocation much more, easily amortizing the change check.

Anyway, given that nobody has proposed some actual benefit from having a getter and setter, we should follow Chuck Moore's advice here: Do not speculate. In this case, this means not introducing a getter and setter.


[r1439] 2025-02-22 21:41:29 ruv replies:

proposal - minimalistic core API for recognizers

Making translator a subtype of xt

Defining state-translate as an alias of execute does not help, because the state-dependence is in the words defined with translate:. Every other execute, compile,, is or defer! might still do something state-dependent because of that even if I have no other source of state-dependence in my program.

Does this mean that, according to your idea, a Forth system is not allowed to define state-translate as an alias of execute?

Otherwise, if a Forth system is allowed to provide such an implementation, then defining state-translate as an alias of execute in your program is not distinguishable from such system's implementation. Other parts of your program simply should not know whether state-translate is alias of execute or not, and so don't depend on that fact.

In general, other parts can do something state-dependent regardless whether state-translate is alias of execute. One can write:

defer foo
: bar ... state-translate ... ;
' bar is foo \ `foo` is state-dependent now (in the general case)
: baz ... foo ... ; \ `baz` is state-dependent (in the general case)

On the other hand, if you do not have other sources of state-dependence in your program (including evaluate and include-file), and you only perform translators using state-translate, how can execute do anything state-dependent in your program other than calling something that calls state-translate?

NOTFOUND

Of course the same person who had earlier complained about the length then complained about incomprehensibility.

Just in case, it wasn't me who complained about the length ;-)

Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.

Thank you, there is a RECTYPE-NULL necessity section in the split-out comments.

In this section the author argues that RECTYPE-NULL (against 0) simplifies the implementation. But the author only considers cases when result of recognizing is used for translation. He does not consider cases when the result of recognizing is used to obtain a semantic token itself (a number, xt, nt, etc). Thus his argument did not hold in the first place. Because there is no point in simplifying a small part of a program at the expense of complicating a larger part. As I have shown, using 0 simplifies programs as a whole, an it is more consistent.

FORTH-RECOGNIZE, deferred or getter and setter?

Anton, I see that you consider only optimization and only in Gforth. I consider programs that extend standard Forth systems in general.

For example, if I want to implement append-perceptor ( xt-recognizer -- ) and prepend-perceptor ( xt-recognizer -- ), I may have to redefine the setter set-perceptor ( xt-recognizer -- ) and getter perceptor ( -- xt-recognizer ). This is impossible without a getter and setter.


[r1440] 2025-02-23 16:28:37 BerndPaysan replies:

proposal - minimalistic core API for recognizers

The C preprocessor is by design line oriented, and can't see beyond a single line. This is unlikely most modern programming languages, which aren't line oriented anymore (Fortran and COBOL e.g. are line-oriented languages, and need line continuation characters, either & at the end in FORTRAN, or '-' in column 7 in COBOL in the next line). Forth is in many respects not a line-oriented language, but it has some line-oriented limitations (e.g. with PARSE).

What we should talk about is to escape line breaks if they shouldn't go into the output and are only in the string to facilitate editability. Then you can copy-paste a C multiline string, and it also works.

And when you forget the closing quote of a string in Forth, you get weird errors, even within the same line. The way to figure what goes wrong is by using a syntax highlighting editor that knows about strings (and if they go multi-line).

: .error-line ( line# error# -- )
  ." error  .  ." in line"  . ; 
*the terminal*:2:23: error: Undefined word
  ." error  .  ." in >>>line"<<<  . ;

Yes, I forgot the closing quote after “error”.


[r1441] 2025-02-24 00:39:15 BerndPaysan replies:

proposal - minimalistic core API for recognizers

Making translator a subtype of xt

As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers without state-dependent translate: children that's just as convenient.

We had that before. It was less convenient.

The whole point is that the translator is or isn't state-dependent, depending on the language you are creating (if it is Forth, it is). The result of moving the state dispatch around showed that this is the position where you actually can get rid of it when your language doesn't have states. You actually don't get rid of the state-infested translator if you say “this is a table, and in order to handle what's in there in the interpreter, you need state-translating“. It's still infested with the concept of states. By putting state-translating into the interpreter, which is a reusable component (you can just replace the entire recognizer stack and read in different languages with normal words like included or evaluate), you force this concept upon all translators, whether their language has that concept or not.

We now have these direct access words (interpreting, compiling, and postponing), and their use is very limited. Two of the three serve as text for the prompt. postponing is used in postpone. And there's the possibility in Gforth, to extend these tables to further states for other languages, which reuse existing recognizers, by patching their operation into the additional field. The newly created operator is used to populate the tables, and to set the state, and that's it.

If you want to get completely rid of state in the long run, put it into the Forth-specific translators. If your modified Forth-like language doesn't need state anymore, your translators won't need it, either. And then, it's just gone.


[r1442] 2025-02-26 16:36:23 ruv replies:

proposal - minimalistic core API for recognizers

Named translators

Anton wrote on 2025-02-15:

The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside

This has several disadvantages:

  • when you use a translator to translate a semantic token, you have to do it via execute (or compile a call using compile, directly);
    • e.g.: xt-translator execute;
  • when a new translator is defined using other translators, you have to call them via execute (or compile,);
  • if you define new translators as colon definitions (which is very convenient), these translators do translation on execution, and if standard named translators return xt on execution — this will lead to inconsistency.

On the other hand, the need of ticking the translator words in recognizers is mitigated when we use a tick recognizer. Thus, instead of ['] translate-xt we can write 'translate-xt (or with back-tick in Gforth's parlance).

Gerund

Anton wrote on 2025-02-16:

It's not clear to me why the gerund form is used (INTERPRETING etc.),

I think, they are temporary quick and dirty names.

According to the naming convention of standard words, the names of these words must begin with an English verb, or be just an English verb, because they perform some actions with side effects, i.e. change some states (the reverse is not true).


[r1443] 2025-03-06 01:10:47 BerndPaysan replies:

proposal - minimalistic core API for recognizers

Gerund

One reason for using this is that the usage of these words has been shown extremely limited (other than postponing, which is used once in postpone), and one of the remaining use cases was to print the current state as readable text in the prompt by just doing get-state id. (id. is getting from the xt to the nt, and then does name>string type).

Grammar-wise, it also looks more natural to use the gerund here.