Digest #292 2025-03-22
Contributions
Maybe it was already discussed, but to me the first test should be: T{ 0. D2* -> 0. }T or maybe T{ 0. D2* -> 0 0 }T because the test is the same on both sides.
Replies
The standard generally leaves it to the system where it puts the compiled code. It might be in the dictionary, but it could also be elsewhere. Or it could be in the dictionary and elsewhere (e.g., Gforth puts threaded code in the dictionary, and native code elsewhere). The standard gives very few guarantees about this, so it also talks very little about it: There is "code space" in "2.1 Definition of Terms", and there is "3.3.2 Code Space" which does not give any guarantees.
Your best bet at reclaiming code space is to use FORGET or MARKER, but there is not guarantee that these words actually reclaim code space. And given the complexity of implementation arising from that, I am thinking about changing Gforth such that it does not reclaim the native code when MARKER is used.
I think this answers the question, so I am closing it. If there is anything unclear yet, write a reply and reopen it.
requestClarification - `NAME>STRING` result is transient
I think the idea was to allow systems that store definition names in a representation other than that returned by NAME>STRING
. One example would be the fig-Forth representation of names, which sets the high bit of the last byte (fig-Forth only supports names in ASCII).
Now that we have had NAME>STRING
in the standard for a decade, we can look at the systems that actually implement this word. If they all return a name that lives as long as the definition, we could enhance this word by giving that guarantee. But who will examine the systems and make the proposal?
how it is implemented, specially where the definition list of the word created using
:NONAME
is compiled.
The standard intentionally does not specify many options, which fall into implementation-defined options (which shall be documented) and implementation-dependent options (which might be undocumented).
Simple implementations typically reserve a big memory region for dictionary and use data space for code space too. Then, in direct and indirect threaded code, the words compile,
and lit,
is defined simply as:
: compile, ( xt -- ) , ;
: lit, ( x -- ) ['] lit compile, , ;
And
:noname 2 * ;
is equivalent to
:noname [ 2 lit, ' * compile, ] ;
is there any way to free the memory?
This is possible using marker
:
marker restore-dict
: foo 123 . ;
foo \ prints "123"
restore-dict
foo \ error: not found
restore-dict \ error: not found
In some Forth system you can create many dictionaries and free them independently of each other.
I have checked. There are about 22 Forth systems on GitHub that provide name>string
implemented in Forth (see the search results).
Among these systems there is only one system, namely solo-forth (description: "Standard Forth system for ZX Spectrum 128 and compatible computers, with disk drives"), in which the word name>string
returns a string in a transit buffer. This system has to copy the resulting string into the transient buffer because the header space and the data space are located in different address spaces.
So, the only advantage of a transient result is that it allows to save memory in some cases. And it only makes sense when saving 10-100 KiB of memory (say, 8% of compiled code size) matters.
An example of approach that can benefit is the use of a trie data structure (prefix tree) to implement efficient searching of word lists. This data structure does not need to store entire strings. Therefore, to save memory, a transient string for a word name can be constructed each time it is needed.
Re interpreting
(similar arguments apply to the word compiling
too)
From the proposal's "Problem" section:
The Forth interpreter is stateful, but the API should avoid the problems of the
STATE
variable. In particular, an implementation withoutSTATE
should be possible, and there is only one place where the stateful dispatch is necessary.
We should consider that Forth words may do stateful dispatch by themselves and they may rely on the value of STATE
. Usually, the Forth system itself cannot determine whether a user-defined word perform stateful dispatch. Therefore, it is essential for the Forth system to ensure the STATE
variable is correctly set to reflect the formal state of the Forth text interpreter when executing a user-defined word.
The assumption that the value of STATE
is irrelevant when xt-int is executed—because xt-int does not perform stateful dispatch itself—is flawed. This is because, when xt-int is executed, it may invoke a user-defined word that performs stateful dispatch.
The suggested word INTERPRETING ( j*x xt -- k*x )
is confusing and useless, because it just executes xt-int (obtained from xt) and does not ensure that the value of STATE
is 0
before a user-defined word is invoked by xt-int.
I suggested the word execute-interpreting
that applies to any xt. When it applies to a token translator, the corresponding interpretation semantics are performed. And this correctly works even when a user-defined word that performs stateful dispatch is invoked by the token translator.
From the rationale to TRANSLATE:
:
The by far most common usage of translators is inside the outer interpreter, and this default mode of operation is called by
EXECUTE
to keep the API small. You can not simply setSTATE
, useEXECUTE
and afterwards restoreSTATE
to perform interpretation or compilation semantics, because words can changeSTATE
, so you need the wordsINTERPRETING
andCOMPILING
defined below.
The provided specification does not guarantee that interpreting
and compiling
solve this problem, and in the reference implementation they do not solve the problem.
The words execute-interpreting
and execute-compiling
solve the problem, and they do not need to know xt-int or xt-comp from a token translator.
Re "postponing" state
What is a rationale to formally introduce "postponing" state? If you need it only for
]] ... [[
, then it's better to extract them into a separate proposal, and also provide a ground why this approach is better than implement]]
as a parsing word.Why do you need to specify that
]]
changesSTATE
to a third value if user-defined words do not see this value and cannot analyze this value?
Re set-state
and get-state
It's unclear how these words can be used.
It seems that the word SET-STATE
is underspecified. Also, it's name is confusing, because it formally is not allowed to change STATE
(at the moment).
Re translators and translate:
translator: named subtype of xt, and executes with the following stack effect: name ( jx ix – k*x )
Why do you require a translator to be named? I use anonymous token translators (defined as quotations) and find them very useful.
From the spec to TRANSLATE:
:
Create a translator word under the name "name". This word is the only standard way to define a general purpose translator.
It's necessary to define what a "general purpose translator" is, and it should be clear how a translator that is not a general purpose translator can be defined in a standard way.
Minimize core
To reduce the scope of discussion, we should minimize the core API.
So, it is better to put the words ]]
, [[
, RECOGNIZER-SEQUENCE
, etc, to separate proposals.
But a recognizer for local variables should be added, because they are already standardized and a Forth system that supports local variables must recognize them.
@ruv: I agree, that it would be nicer to have sequences etc. not in the proposal. However: They are a good example of how recognizers can be used in practice, which makes me think they should stay inside to give some guidance to rec:newbies
. We can still ask Bernd for further modifications, but I think we should do so grouped together after the meeting, to avoid unnecessary edits.
I don't see a problem to separate the proposal into several smaller ones, especially taking optional parts out that belong together.
The postpone mode can indeed either implemented loop-style (i.e. like PolyForth's ]
), or with a state; it shouldn't be necessary to specify the details.
If you have STATE
-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves having STATE
as expected, you can't just call INTERPRETING
or COMPILING
on a translator or use some table index mechanism as in the Trute proposal to call the right slot.
If you don't have such things and have a Forth system where STATE
-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE
-smart words yourself, you can actually use that API. That's why I think such an API can be actually standardized before we make STATE
obsolescent and have standardized replacements available.
can actually be standardized
I mean can't. We need to phase out STATE
and define possible replacements before we can have a STATE
less API.
At the online meeting on 2025-02-13 I was asked to present a
subproposal for factoring the state-dependent component out of
TRANSLATE:
.
There are many possible ways to skin this cat, e.g., the one in Matthias Trute's proposal, or the way that present proposal used up to v4 and earlier. Here I present a way that requires relatively few changes to the current version of this proposal.
XY.3.1 Definition of terms
Replace the definition of translator with:
translator: a cell-sized opaque token that represents how a recognized lexeme can be interpreted, compiled, or postponed. A translator usually needs additional data about the recognized lexeme that is deeper in the stacks.
Replace uses of translator-xt in ?NOTFOUND
with translator, and
likewise for other words that, in [r1412], consume or
push the xt of a translator.
XY.6 Glossary
TRANSLATOR:
Replace the definition of TRANSLATE:
with
TRANSLATOR:
( xt-int xt-comp xt-post "<spaces>name" -- )
Skip leading space delimiters. Parse name delimited by a space. Create a definition for name with the execution semantics defined below.
name is referred to as translator.
name Execution: ( -- translator )
translator represents a translator with interpretation action xt-int, compilation action xt-comp, and postpone action xt-post..
Modified words:
INTERPRETING
( i*x translator -- k*x )
Execute xt-int of translator.
COMPILING
( j*x translator -- l*x )
Execute xt-comp of translator.
POSTPONING
( j*x translator -- )
Execute xt-post of translator.
STATE-TRANSLATING
Add:
STATE-TRANSLATING
( i*x translator -- j*x )
Remove translator from the stack.
If the system has a postpone state, and is currently is in postpone state, execute xt-post of translator.
Otherwise, if the system is in interpretation state, execute xt-int of translator.
Otherwise, execute xt-comp of translator.
Discussion
The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside (compared to [r1412]).
The benefit of factoring out state-translating
is that the state
dependence can be confined to the place(s) that actually need state
dependence: The standard Forth text interpreter (and user-defined text
interpreters that are intended to work similarly). It does not infect
all translators.
Typical use
The standard interpreter loop:
: interpret ( i\*x -- j\*x )
BEGIN parse-name dup WHILE forth-recognize ?found state-translating REPEAT
2drop ;
Implementation of POSTPONE
is the same as in the existing proposal:
: postpone ( "name" -- )
parse-name forth-recognize ?found postponing ; immediate
The implementation of '
becomes slightly shorter (no need to tick
translate-nt
:
: ' ( "name" -- xt )
parse-name forth-recognize ?found
translate-nt <> #-32 and throw
name>interpret ;
Now for interpreter loops that do not use STATE
.
First, the polyForth division of interpreter and compiler:
: parse-name-refill ( -- c-addr u )
begin
parse-name dup 0= while
2drop refill 0= if
0 0 exit then
repeat ;
: ] ( i\*x -- j\*x )
BEGIN
parse-name-refill dup while
2dup "[" str= 0= while
forth-recognize ?found compiling
REPEAT
2drop ;
: pf-interpret ( i\*x -- j\*x )
BEGIN parse-name-refill dup WHILE forth-recognize ?found interpreting REPEAT
2drop ;
And here's one for colorforth-bw:
: cfbw-interpret ( i\*x -- j\*x )
begin
parse-name dup while
over c@ >r 1 /string forth-recognize ?found r> case
'[' of interpreting endof
'_' of compiling endof
']' of postponing endof
-13 throw
endcase
repreat ;
The problem with these interpreters is that there is no standardized
or proposed way to plug this interpret
into the existing
infrastructure (e.g., included
), so the benefit of being able to
write this is limited to one line (in case of colorforth-bw) or the
rest of the file in case of the polyForth-style interpreter.
But the recognizer proposal allows to replace forth-recognizer
, and
this allows us to plug in colorforth-bw into the text interpreter
until further notice. I presented a way to do it with an earlier
version of this proposal in
[r1397],
here's a way for doing it with [r1412] modified by this
sub-proposal:
defer recognizer1 action-of forth-recognize is recognizer1
: translator-bw1 ( i\*x translator c -- j\*x )
case
'[' of interpreting endof
'_' of compiling endof
']' of postponing endof
-13 throw
endcase ;
' translator-bw1 dup dup translator: translator-bw
: recognize-colorforth-bw ( c-addr u -- translator )
dup 0= if 2drop 0 exit then
over c@ >r 1 /string recognizer1
r> over if translator-bw else drop then ;
' recognize-colorforth-bw is forth-recognize
Reference implementation:
A straightforward implementation is:
: translator: ( xt-int xt-comp xt-post "\<spaces\>name" -- )
create , , , ;
: state-translating ( i\*x translator -- j\*x )
state @ if compiling else interpreting then ;
This does not cover a potential postpone state; if a system has a
postpone state and can enter the standard text interpreter in this
state, then the implementation of state-translating
should be
extended accordingly.
Of course, this implementation of state-translating
is far too
inefficient for some tastes, so here's a more clever one:
: state-translating ( i\*x translator -- j\*x )
2 state @ 0<> + cells + @ execute ;
For even more efficiency we can redefine `]' and '[':
defer state-translating
: [ ( -- )
[ ( old implementation ) ['] interrpreting is state-translating ; immediate
[ \ initialize state-translating
: ] ( -- )
] ( old implementation ) ['] compiling is state-translating ;
If there is a word that sets the postpone state, that word should also
set state-translating
accordingly.
There are also a changes involving words that push literal translator
tokens. In [r1412] the translator word needs to be
ticked, in this subproposal you do not do that. E.g., rec-nt
now
looks as follows:
: rec-nt ( addr u -- nt nt-translator | 0 )
forth-wordlist find-name-in dup IF translate-nt THEN ;
STATE-dependence
[r1412] still contains a defining word for state-dependent translators (and none for translators without this mistake), which are unacceptable to me. I have suggested an improvement in [r1426].
Dividing the proposal?
There have been some discussions about dividing the proposal. I don't think that that's a good idea for the discussion, but in usage I see the division into the following hierarchy of use cases, which require different words; the later use cases usually require also implementing the words for the earlier use cases:
Programs that use the default recognizers. For them we need to specify a standard recognizer sequence (including how to deal with locals):
REC-NT
REC-NUM
REC-FLOAT
(if present) corresponds to Forth-2012. I expect that systems that haveREC-STRING
andREC-TICK
to put these into their recognizer sequence, too. How do we document in the program documentation which recognizers are needed? Probably we need to extend the program documentation requirements (until now the recognition of doubles, floats and locals has been coupled with documenting the double, float and local wordset, respectively, but forREC-STRING
andREC-TICK
that's probably not the way to go).The new
POSTPONE
is also at that usage level.Programs that change which of the existing recognizers are used and in what order. For them we need the names of the existing recognizers (not sure about the translators),
FORTH-RECOGNIZE
,SET-RECOGNIZER-SEQUENCE
,GET-RECOGNIZER-SEQUENCE
,.RECOGNIZERS
(not yet proposed) and maybeRECOGNIZER-SEQUENCE:
. If all the standardized recognizers are inFORTH-RECOGNIZE
by default, there will probably not be much of this kind of usage, except maybe to putREC-FLOAT
in front ofREC-NUM
(to recognize "1." as float;REC-FLOAT
would have to be to defined in more detail for that to work).Programs that define new recognizers that use existing translators. This usage needs the names of the translators.
Programs that define new translators. This usage needs
TRANSLATE:
(orTRANSLATOR:
).Programs that define text interpreters and programming tools that have to deal with recognizers (such as a recognizer-aware
postpone
). These programs needINTERPRETING
,COMPILING
,POSTPONING
orSTATE-TRANSLATING
.
A system with recognizers is a program of all these types, so all
these words will be present in every such system (with the exception
of some recognizers and related translators), so there is little point
in making most of these words optional (except rec-float
,
rec-string
, rec-tick
and translators used only by those
recognizers). But it is still a good idea to present the words
divided by these usages. We usually present words in alphabetical
order in the document. Should we continue this tradition for these
words? If so, the division of words above should probably be
documented in the rationale.
For word counters
Given that usage 5 above is rare in user programs, word counters may
prefer to replace the four words INTERPRETING
, COMPILING
,
POSTPONING
or STATE-TRANSLATING
with one word
TRANSLATING
( ix translator n -- jx )
where
0 TRANSLATING
is equivalent toINTERPRETING
-1 TRANSLATING
is equivalent toCOMPILING
-2 TRANSLATING
is equivalent toPOSTPONING
STATE @ 0<> TRANSLATING
is equivalent to the reference implementation ofSTATE-TRANSLATING
A simple Forth system has only one use of POSTPONING
(in POSTPONE
)
and one use of STATE-TRANSLATING
(in INTERPRET
), so defining 4
words for the purpose may seem excessive. And replacing them with
TRANSLATING
saves a tiny bit of source code and memory.
OTOH, there is no standard way to use TRANSLATING
for
STATE-TRANSLATING
in the general case, where the system has a
postpone state, because there is no standard way to determine postpone
state. Moreover, the specification of TRANSLATING
is not so nice
(that's why I left it out in the above), and the code using it will be
less readable.
Gerund
It's not clear to me why the gerund form is used (INTERPRETING
etc.), although I kept with it for my suggestions (for consistency).
I would use an imperative form; and because "interpret", "compile" and
"postpone" are already taken, maybe something like
TRANSLATOR>INTERPRET
or somesuch, which would parallel
NAME>INTERPRET
. However, the latter pushes an xt, the former
executes it, so either we let TRANSLATOR>INTERPRET
also produce an
xt, or use a slightly different naming scheme, such as
TRANSLATOR*INTERPRET
.
GET-STATE SET-STATE
It's unclear what get-state
and set-state
do, and their names
suggest a stack effect ( -- f ) and ( f -- ).
The reference implementation does not make that any clearer; in
particular, the reference implementation of set-state
does not make
any sense at all, and I would not know why anybody would want to use
get-state
.
[IF] parts
This makes the proposal hard to understand and discuss. Take a decision (possible after asking around, but I doubt that anyone but you and maybe ruv has a proper basis for an opinion), put it in the proposal, and give a rationale for the decision in a section Discussion.
Side effects
I do not see a good way to specify in the normative part of the document that a recognizer must not have a side effect. The proposal mentions "supposed to" and "promise". The normative part says what specific words do (or there is an ambiguous condition). It seems to me that the discussion about side effects should go into the non-normative rationale. It's clear enough what happens when somebody uses a word that invokes a recognizer, and that recognizer has a side effect; no need for an ambiguous condition.
NOTFOUND
I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?
FORTH-RECOGNIZE
, deferred or getter and setter?
I see no benefits to having a getter and setter here. Deferred words are fine.
Presentation
The "Solution" chapter is not comprehensible except to those deep into the discussion: It is full of unexplained terms, such as "data parsing", "token type". And "translator" is not comprehensible to anybody who comes fresh to the proposal, and even to those who have seen some earlier recognizer proposals.
The second part of "Solution" should be a separate section "Transition for some implementors/users of Matthias Trute's proposal".
More NOTFOUND
stuff
The proposal defines ?FOUND
, ?NOTFOUND
, and NOTFOUND
only for
NOTFOUND=0. This looks like a bug to me.
The stack effect of ?FOUND
and other words: We do not have "never" in
the standard. What's that supposed to mean?
XY.3.1 Translator
"named subtype"? What's that? The rest of the wording is woefully inadequate. A careful specification would reveal the complexity that you get with state-dependent translators.
?NOTFOUND
?NOTFOUND
has a horrible stack effect. This word is not shown in
any typical use examples? Is it needed? If it is needed, maybe the
stack effects of the other words can be changed to make it
unnecessary; although, admittedly, when I worked on combining
recognizers, I did not find a solution with a nice stack flow (and I
have tried). Hmm, maybe with a variant of case
with a specialized
variant of of
?
POSTPONE
"if the exception wordset is not present". The exception wordset has been a required part of Forth200x for several years.
SET-RECOGNIZER-SEQUENCE
As specified, the sequence will always fit. Can the sequence fail to fit? If so, specify what happens.
REC-NUM
Should this be the all-singing, all-dancing variant (including
doubles, number prefixes and '<char>')? Given existing practice and
the legacy code base, yes. OTOH, with recognizers it seems a
conceptually attractive option to have the rec-num
be a decomposable
sequence consisting of the various cases. But given nestable
recognizer sequences, that's always an option for the future.
SCAN-TRANSLATE-STRING
This should follow C conventions for newlines like the rest of the string syntax, i.e., escape newlines with \. If other conventions are desired (e.g. what may or may not be JSON syntax), that would be for another recognizer and another translator.
The specification should be clear about what it does: "REFILL
can be
used to read in more lines" is neither here nor there.
TRANSLATE-STRING ?SCAN-STRING
What are these words good for? REC-STRING
apparently does not need
them.
[[
A word without interpretation nor compilation semantics?
Should we specify whether there is a postpone state, or alternatively
that ]]
has its own text interpreter loop? There are ways to
distinguish these two kinds of implementation; does it matter? Maybe
if you want to EVALUATE
something in postpone state or somesuch.
]]
and [[
should probably go into a separate proposal.
STATE
Changing the specification of state
such that there is at least one
non-zero value that does not mean "compilation state" is not an
extension of the current specification of state
, but a change.
However, existing practice of systems which use -2 as postpone state
suggest that this does not break existing code in practice. That's
probably because so little existing code actually uses postpone state.
With wider use of postpone state, some breakage may actually turn up.
The safe option would be to represent postpone state (if we have it at
all) in a way other than through a value of STATE
. E.g., have
another variable POSTPONE-STATE
: if it's false, then STATE
determines the state; if it's true, the system is in postpone state.
In any case, if we put ]]
in another proposal, that's where we
should have this discussion.
Multiline strings
I don't think C is setting a good example. Nobody took C's syntax for proper multiline strings, not even C++. C is still an important legacy language, but COBOL also is in the top 20. You don't want to have multiline strings like COBOL.
- C++11 got raw strings, and gcc supports them even in C. The syntax has a
R"(
as start, and a)"
as end (with the option of adding more letters to disambiguate the string ending). Raw strings don't translate backslash+characters, which is often what you want, because the multiline string is actually some other programming language, and the editor is fine inserting all the characters you want there without escapes. Note that you need some way to disambiguate the string ending in a raw string, as you can't escape"
. - Rust, Visual Basic (≥14), R, Ruby, and PHP strings are multiline by default (inserting newlines where the string has line breaks)
- JavaScript (using template literals) and Go uses
`
(backtick) for multiline (raw) strings - C# uses
@"
to start a multiline string - SQL uses
'
(single quote) for multiline strings - Java 15 has text blocks (with
"""
as start and end) - Python use either
"""
or'''
or for multiline strings
Nobody makes proper multiline strings like C. Really nobody. Not even recent C compilers, they follow C++. I'm now at item 20 of Tiobe index, and most languages nowadays have multiline strings one way or the other. Getting Emacs to recognize multiline strings was easy: Just remove the \n
from the end of string pattern. Emacs likes multiline strings. JSON-variants with multiline strings are likely from developers that use Ruby or PHP. You have to deal with this sort of stuff.
The most popular option seem to be multiline strings by default, when legacy (e.g. through a C-like syntax) isn't a problem. As we are adding a new syntax for string literals, we don't need to care about backwards compatibility. One popular feature is to remove blanks from auto-indented strings, as editors indent these strings. Strictly speaking, if we support non-raw multi-line strings, we could even parse C strings, if a \ as last character is defined as “don't add a newline here” (instead of “unfinished escape sequence”).
Bernd writes:
If you have
STATE
-smart words in your system or user-defined such words, the only way to get the correct interpretation and compilation semantics involves havingSTATE
as expected, you can't just callINTERPRETING
orCOMPILING
on a translator or use some table index mechanism as in the Trute proposal to call the right slot.
Right.
If you don't have such things and have a Forth system where STATE-free replacement mechanisms are used for dual-semantics words (e.g. Gforth or VFX), and you don't define STATE-smart words yourself, you can actually use that API.
In Forth, you almost always have such things, because you have EVALUATE
and INCLUDE-FILE
, which depend on STATE
.
INCLUDE-FILE
translates a file, EVALUATE
translates a string. In practice, it's also necessary to translate a single lexeme, or even a single semantic token (like a number, xt, nt).
Bernd writes:
We need to phase out
STATE
and define possible replacements before we can have aSTATE
less API.
Recognizers already don't depend on STATE
. Only some token translators depend on STATE
. But we cannot avoid them in Forth system, and cannot eliminate STATE
.
The existence of interpretation semantics and compilation semantics of Forth words is associated with two modes (states) of the Forth text interpreter: interpretation state and compilation state. The only way to essentially eliminate STATE
is to eliminate one of these modes and the corresponding semantics. For example, one could remove interpretation state and interpretation semantics of words. This is possible, but the resulting language will not be backwards compatible with Standard Forth, since any parsing word must be an "immediate" word in this language.
For example, without interpretation state it's impossible to translate the following program:
: my' ['] ' execute ;
my' my' constant mytick-xt
Changing the search order outside of definitions is also problematic:
also myvoc myword ( x ) previous constant my-x
In this line, myword
must be recognized in the modified search order. This is only possible in interpretation state, which means that the next lexeme is recognized only after the previous lexeme has been recognized and executed.
Factor is an example of a Forth-like language without interpretation sate. There, ordinary words are always "compiled" (added to AST), parsing words (and syntax words) are always immediately executed. See: Factor / Syntax / Parser algorithm.
Anton writes
Add:
STATE-TRANSLATING
( ix translator -- jx )
Why is this better than making translator a subtype of xt, and using EXECUTE
instead of STATE-TRANSLATING
?
The benefits of making translator a subtype of xt:
no need for a separate word (for word counters);
a translator can be defined as a quotation or anonymous definitions (sometimes this is very convenient);
a new translator can be simply defied using other translators;
- an example for illustration the idea:
in some my implementations example,: translate-2lit ( 2*x -- 2*x | ) >r translate-lit r> translate-lit ;
postpone
correctly applies to a lexeme that is recognized into a qualified semantic token with this translator.
- an example for illustration the idea:
the Forth text interpreter loop can be re-used for other purposes;
- just for illustration, reuse the Forth text interpreter to count lexemes in a string:
See the: count-lexemes ( sd.string -- u ) 0 rot rot ['] example.evaluate [: 2drop 1+ ['] noop ;] apply-perceptor ; s" a b c d" count-lexemes . \ prints "4"
apply-perceptor
word definition in recognizer-api-ext.fth
- just for illustration, reuse the Forth text interpreter to count lexemes in a string:
Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:
The benefit of factoring out
state-translating
is that thestate
dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.
This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating
against `executefor the API users.
Please note that this is irrelevant to the question of whether translator is a subtype of xt or not.
For example, Translator is 1+
``` : count-lexemes ( sd.string -- u ) 0 rot rot ['] example.evaluate [: 2drop ['] 1+ ;] apply-perceptor ;
I provided above an example of a translator that is an xt, and
the only difference to the API users is whether state-translate
or execute
is used. And the latter allows provides more useful use cases.
The above message is a draft that was sent accidentally. A better edition is below -)
Anton writes in [r1426], 2025-02-15, in the "Discussion" sub-section:
The benefit of factoring out
state-translating
is that thestate
dependence can be confined to the place(s) that actually need state dependence: The standard Forth text interpreter (and user-defined text interpreters that are intended to work similarly). It does not infect all translators.
This seems irrelevant to the question of whether translator is a subtype of xt or not. I don't see any benefit of using state-translating
against execute
for the API users.
For example, in the case of execute
a translator can be even as simple as 1+
:
: count-lexemes ( sd.string -- u )
0 rot rot ['] evaluate [: 2drop ['] 1+ ;] apply-perceptor
;
This translator is not infected either by state
or by a dummy triple ( xt-int xt-comp xt-post ) .
I use recognizers for non-Forth languages. These languages are usually state-free, i.e. they are interpret- or compile-only. Using a quotation for the translator is completely sufficient. E.g. the recognizer in net2o's chat message that matches URLs has
[: rework-% $, msg-url ;]
as translator. No need to define a triple-entry translator table. And the translators are indeed all that short, and there's no reusability (a token translates 1:1 to a command plus a way to add the corresponding data). This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it (I ended up with the generic name-translator, and just put the xt I wanted to execute on the stack underneath, so it worked in interpretation state, but was actually compiling message into a buffer). The text messages are parsed by standard EVALUATE
, but a language-specific recognizer stack that has no single Forth recognizer in it.
Therefore I disagree with Anton that the current translator concept ties STATE
to every translator: it's the other way round. It ties them only to full-blown Forth translators that work in a mixed interpreter/compiler language, where there is a state (and there, it is inevitable, and you can move that dispatch only around). You can define translators used by Forth with TRANSLATE:
, but you can define translators used by other (single-state) languages just as ordinary xt, and with a single action for translation. There's no need for the table and dispatch if your language has no state at all, it's just EXECUTE
of the one single action.
When you want to reuse slots of system translators in Gforth (e.g. for a Color-Forth clone), you can use action-of interpreting
/compiling
/postponing
( translator -- xt ) to access the subfields. That's, because all these accessing words are just identical to the defer field for value-style structures.
E.g. Anton's example could be
: cf-recognizer ( <[_]>addr u -- data translator | 0 )
sp@ fp@ {: sp' fp' :} over c@ >r 1 /string recognizer1
dup 0= IF rdrop EXIT THEN
case r> '[' of action-of interpreting endof
'_' of action-of compiling endof
']' of action-of postponing endof
fp' fp! sp' sp! 2drop 0 dup
endcase ;
and that works (the vocabularies used by the colorForth core wouldn't have any STATE
in it). That way, you don't need to write your own outer interpreting colorForth, the standard Forth interpreter does it.
My design assumption was that making all new data types (recognizer sequences, translators) subtypes of xt, and therefore executable, will pay off, and it did.
Multiline strings
Checking on Python3, I see that it uses C's syntax for strings
starting with "
. In particular, if you just do a newline in the
middle of a string without escaping the newline, you get an error:
>>> print("abc
File "\<stdin\>", line 1
print("abc
^
SyntaxError: unterminated string literal (detected at line 1)
An escaped newline is ignored, and you need to write \n
to get an
actual newline. I expect that it's the same for most other languages
you mention, because they all use a different syntax for "proper
multiline strings". I have no problem with an additional recognizer
for "proper multiline strings" with a distinguishable syntax (such as
"""
); I can even live with rec-string
doing the additional syntax,
but I think that there might be others who will disparage it as a
WIBNI or somesuch.
But I think that, for "
-delimited strings, rec-string
should either
not do multi-line strings at all or do it the C/Python3/etc. way.
STATE-TRANSLATING
Why is this better than making translator a subtype of xt, and using EXECUTE instead of STATE-TRANSLATING?
It is better because it isolates the state-dependence in the word(s)
calling state-translating
rather than having it in the translator
coming out of the recognizer and potentially being invoked through any
execute
, compile,
, is
or defer!
in the system (with data-flow
analysis necessary to reduce the number of potential invocations, and
the result of that analysis probably still showing more occurences
than what searching for state-translating
would otherwise give us).
It's similar to the difference between arming a bomb at the factory, or arming it only just before dropping it (which may never happen).
Examples of translators not produced with translate:
The proposal states about translate:
This word is the only standard way to define a general purpose translator.
Any argument based on defining translators in other ways is therefore not in line with the proposal.
This applies to [r1432] as well as [r1433].
So the usages you show may work on some particular implementation, but may fail on a different implementation of the proposal.
And if you are willing to design an implementation for some convenient
code of your interpretation-only recognizers, I am sure that your are
able to design an implementation of recognizers with
state-translating
that's just as convenient.
Making translator a subtype of xt
Why is this better than making translator a subtype of xt
It is better because it isolates the state-dependence in the word(s) calling
state-translating
rather than having it in the translator coming out of the recognizer and potentially being invoked through anyexecute
,compile,
,is
ordefer!
in the system (with data-flow analysis necessary to reduce the number of potential invocations, and the result of that analysis probably still showing more occurences than what searching forstate-translating
would otherwise give us).
1. It does not isolate the state-dependence in the word(s) calling state-translating
— because interpreting
and compiling
will also exhibit state-dependent behavior on some arguments. At the same time, state-translating
will exhibit state-independent behavior on some arguments.
2. If the user prefer the word state-translating
because it allows him to find invocations of translators in his code, he can define this word as synonym state-translating execute
and use in his code.
3. Given the choice between the ability to find invocations of translators and the set of benefits that an xt subtype provides, I would prefer the latter.
Any argument based on defining translators in other ways is therefore not in line with the proposal.
Yes, but they are aimed at changing the proposal ))
And if you are willing to design an implementation for some convenient code of your interpretation-only recognizers, I am sure that your are able to design an implementation of recognizers with state-translating that's just as convenient.
Bernd wrote: "This thing used to be a bit more complex when it was still based on the Trute recognizers, because then, I always needed a table, and used only one slot of it".
Side effects
Anton wrote:
It seems to me that the discussion about side effects should go into the non-normative rationale.
Agreed. A standard word cannot have an unspecified side effect that can be detected by a standard program. Therefore, it's sufficient to specify the allowed effects for standard recognizers and for the perceptor in the standard Forth system.
NOTFOUND
Anton wrote:
I have no preference here, but I remember that Matthias Trute presented a case for notfound, and that sounded convincing. Why do his arguments no longer hold (or did they not hold in the first place)?
In Matthias Trute's proposal I don't see any arguments why NOTFOUND is better than zero.
See my arguments why zero is better in the section "Special data object on failure considered harmful" of my comment [r1351] 2024-10-08.
FORTH-RECOGNIZE, deferred or getter and setter?
Anton wrote:
I see no benefits to having a getter and setter here. Deferred words are fine.
Anton, you wrote in comp.lang.forth on 2024-10-05: "I wish they had defined GET-BASE
and SET-BASE
instead of BASE
".
You seem to see shortcomings of BASE
. The shortcomings of the deferred word FORTH-RECOGNIZE
are similar: if additional actions are needed on set or get the value, this is difficult to implement in a system and almost impossible in a program. And this word cannot be redefined by a program.
Multiline Strings
Anton, you seemed to miss the largest group that does it identical: Rust, Visual Basic (≥14), R, Ruby, and PHP. A number of languages who started with C-like strings did not continue to follow that example and then obviously needed a different syntax to stay backwards compatible (VB didn't have C-like strings to begin with and the multiline extension was compatible). If we introduce multiline strings as new feature, we should not first copy a bad example and then add another syntax to fix that.
The primary reason why C's multiline string style is so weird is the preprocessor: the preprocessor has a rudimentary understanding of the language, and it uses \ at the end of the line to concatenate multiple lines. That makes it create single-line entries out of strings, and then it understands that all this is just one string it shouldn't look inside (C macros aren't replaced in strings).
There's absolutely no need to copy C's weird strings caused by their weird preprocessor approach into Forth.
Checking Ruby (the only of this bunch of languages that I have installed), I see that it indeed has strings that include newlines. So yes, there are programming languages that allow unescaped newlines in their most popular string syntax, instead of introducing a separate syntax for multi-line strings.
Why is this a mistake? The common case is that a string ends on the same line where it started. If the string terminator is missing on that line, it is often a mistake, and a friendly programming language has a syntax for single-line strings that allows catching the mistake right on that line. By contrast, in Ruby I get:
[~:155788] ruby
puts 'hello,
puts 'world'
-:2: syntax error, unexpected local variable or method, expecting end-of-input
puts 'world'
So it gives me a misleading error message on a different line from where the mistake happened, possibly several lines later. That's why many languages require either escaping the newline or a different syntax for multi-line strings.
The reason for that has nothing to do with backwards compatibility: These languages report an error if there is an unescaped newline in a string with the most popular syntax. Defining that case to do what you want rather than as an error does not break any existing, working programs.
The reason also has nothing to do with the C preprocessor. The C preprocessor has to know when something is inside or outside a string (it must not do macro expansion inside string literals), so it could just as well accept a newline inside a string.
I don't have an opinion on whether we should use an alternative delimiting syntax for multi-line strings, escape the newlines in the syntax for single-line strings, or have both options.
Making translator a subtype of xt
Yes, having the translators not being
state
-dependent does not prevent people from performingstate
-dependent code (and it should not). But what it gives me is that if I avoid such code (and I do), I do not have to worry about state-dependence in everyexecute
,compile,
,is
anddefer!
, only instate-translating
.Defining
state-translate
as an alias ofexecute
does not help, because thestate
-dependence is in the words defined withtranslate:
. Every otherexecute
,compile,
,is
ordefer!
might still do somethingstate
-dependent because of that even if I have no other source ofstate
-dependence in my program.My preference is for translators without
state
-dependence. As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers withoutstate
-dependenttranslate:
children that's just as convenient.
NOTFOUND
After complaints about the proposal being too long, Matthias Trute removed lots of the rationale in one version of his proposal, including the rationale for not using 0. Of course the same person who had earlier complained about the length then complained about incomprehensibility. Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.
FORTH-RECOGNIZE, deferred or getter and setter?
For base
, for the optimization I have in mind, one would have to check on every use of #
whether base
has changed in the meantime, and that cost would be substantial compared to the benefit of the optimization. And it's not just set-base
that would avoid the problem: If base
was a uvalue (not a uvarue), it would be relatively easy to eliminate the check in Gforth.
For forth-recognize
, I have no such optimization in mind. If I had, it would be relatively easy to implement in Gforth without change check for forth-recognize
, because forth-recognize
is a deffered word, not a variable.
But even on a system where you cannot attach the optimization to the defer!
method of forth-recognize
, inserting a change check would be much less of a problem than for base
: forth-recognize
tends to be much more expensive than #
, so any optimization with noticable benefit will also reduce the cycles per invocation much more, easily amortizing the change check.
Anyway, given that nobody has proposed some actual benefit from having a getter and setter, we should follow Chuck Moore's advice here: Do not speculate. In this case, this means not introducing a getter and setter.
Making translator a subtype of xt
Defining
state-translate
as an alias ofexecute
does not help, because the state-dependence is in the words defined withtranslate:
. Every otherexecute
,compile,
,is
ordefer!
might still do something state-dependent because of that even if I have no other source of state-dependence in my program.
Does this mean that, according to your idea, a Forth system is not allowed to define state-translate
as an alias of execute
?
Otherwise, if a Forth system is allowed to provide such an implementation, then defining state-translate
as an alias of execute
in your program is not distinguishable from such system's implementation. Other parts of your program simply should not know whether state-translate
is alias of execute
or not, and so don't depend on that fact.
In general, other parts can do something state-dependent regardless whether state-translate
is alias of execute
. One can write:
defer foo
: bar ... state-translate ... ;
' bar is foo \ `foo` is state-dependent now (in the general case)
: baz ... foo ... ; \ `baz` is state-dependent (in the general case)
On the other hand, if you do not have other sources of state-dependence in your program (including evaluate
and include-file
), and you only perform translators using state-translate
, how can execute
do anything state-dependent in your program other than calling something that calls state-translate
?
NOTFOUND
Of course the same person who had earlier complained about the length then complained about incomprehensibility.
Just in case, it wasn't me who complained about the length ;-)
Anyway, you can find earlier versions of the proposal through Forth200x; there is also a link to the split-out comments there.
Thank you, there is a RECTYPE-NULL
necessity section in the split-out comments.
In this section the author argues that RECTYPE-NULL
(against 0
) simplifies the implementation. But the author only considers cases when result of recognizing is used for translation. He does not consider cases when the result of recognizing is used to obtain a semantic token itself (a number, xt, nt, etc). Thus his argument did not hold in the first place. Because there is no point in simplifying a small part of a program at the expense of complicating a larger part. As I have shown, using 0
simplifies programs as a whole, an it is more consistent.
FORTH-RECOGNIZE, deferred or getter and setter?
Anton, I see that you consider only optimization and only in Gforth. I consider programs that extend standard Forth systems in general.
For example, if I want to implement append-perceptor ( xt-recognizer -- )
and prepend-perceptor ( xt-recognizer -- )
, I may have to redefine the setter set-perceptor ( xt-recognizer -- )
and getter perceptor ( -- xt-recognizer )
. This is impossible without a getter and setter.
The C preprocessor is by design line oriented, and can't see beyond a single line. This is unlikely most modern programming languages, which aren't line oriented anymore (Fortran and COBOL e.g. are line-oriented languages, and need line continuation characters, either &
at the end in FORTRAN, or '-' in column 7 in COBOL in the next line). Forth is in many respects not a line-oriented language, but it has some line-oriented limitations (e.g. with PARSE
).
What we should talk about is to escape line breaks if they shouldn't go into the output and are only in the string to facilitate editability. Then you can copy-paste a C multiline string, and it also works.
And when you forget the closing quote of a string in Forth, you get weird errors, even within the same line. The way to figure what goes wrong is by using a syntax highlighting editor that knows about strings (and if they go multi-line).
: .error-line ( line# error# -- )
." error . ." in line" . ;
*the terminal*:2:23: error: Undefined word
." error . ." in >>>line"<<< . ;
Yes, I forgot the closing quote after “error”.
Making translator a subtype of xt
As for Bernd Paysan simplifying code when rewriting it, sure, that's his way. That's why I expect that, if he puts his mind to it, he will design an implementation of recognizers without
state
-dependenttranslate:
children that's just as convenient.
We had that before. It was less convenient.
The whole point is that the translator is or isn't state
-dependent, depending on the language you are creating (if it is Forth, it is). The result of moving the state dispatch around showed that this is the position where you actually can get rid of it when your language doesn't have states. You actually don't get rid of the state-infested translator if you say “this is a table, and in order to handle what's in there in the interpreter, you need state-translating
“. It's still infested with the concept of states. By putting state-translating
into the interpreter, which is a reusable component (you can just replace the entire recognizer stack and read in different languages with normal words like included
or evaluate
), you force this concept upon all translators, whether their language has that concept or not.
We now have these direct access words (interpreting
, compiling
, and postponing
), and their use is very limited. Two of the three serve as text for the prompt. postponing
is used in postpone
. And there's the possibility in Gforth, to extend these tables to further states for other languages, which reuse existing recognizers, by patching their operation into the additional field. The newly created operator is used to populate the tables, and to set the state, and that's it.
If you want to get completely rid of state
in the long run, put it into the Forth-specific translators. If your modified Forth-like language doesn't need state
anymore, your translators won't need it, either. And then, it's just gone.
Named translators
Anton wrote on 2025-02-15:
The benefit of having each translator word return a translator token is that one does not need to tick the translator words in all the recognizers. A slight improvement in writability and readability with no downside
This has several disadvantages:
- when you use a translator to translate a semantic token, you have to do it via
execute
(or compile a call usingcompile,
directly);- e.g.:
xt-translator execute
;
- e.g.:
- when a new translator is defined using other translators, you have to call them via
execute
(orcompile,
); - if you define new translators as colon definitions (which is very convenient), these translators do translation on execution, and if standard named translators return xt on execution — this will lead to inconsistency.
On the other hand, the need of ticking the translator words in recognizers is mitigated when we use a tick recognizer. Thus, instead of ['] translate-xt
we can write 'translate-xt
(or with back-tick in Gforth's parlance).
Gerund
Anton wrote on 2025-02-16:
It's not clear to me why the gerund form is used (
INTERPRETING
etc.),
I think, they are temporary quick and dirty names.
According to the naming convention of standard words, the names of these words must begin with an English verb, or be just an English verb, because they perform some actions with side effects, i.e. change some states (the reverse is not true).
Gerund
One reason for using this is that the usage of these words has been shown extremely limited (other than postponing
, which is used once in postpone
), and one of the remaining use cases was to print the current state as readable text in the prompt by just doing get-state id.
(id.
is getting from the xt to the nt, and then does name>string type
).
Grammar-wise, it also looks more natural to use the gerund here.