Digest #120 2020-09-08
- 2020-09-07 The document is published at forth-standard.org
The different proposals about recognizers use the different terminology that conflict with each other and with the language of the Standard. We have many examples of that.
Also, the wrong terminology produces the names for words that have confusing etymology. And after all, it discredits the Forth language in the more wide community of programmers.
Let's use the common terminology that is correct, accurately defined, and compatible with the Standard language. I suggest the following one. The latest version is available at GitHub. Improvements are welcome.
tuple: a logical union of several elements that keeps their order; when a tuple is placed into the data stack, the rightmost element in writing is the topmost on the stack, and floating-point numbers are placed into the floating-point stack.
lexeme: a syntactic unit of a program (a source code); unless otherwise noted, it is a sequence of non-blank characters delimited by a blank.
to recognize a lexeme: to determine the interpretation semantics and the compilation semantics for the lexeme in the current dynamic context.
to interpret a lexeme: to perform the interpretation semantics for the lexeme in the current dynamic context.
to compile a lexeme: to perform the compilation semantics for the lexeme in the current dynamic context.
to translate a lexeme: to interpret the lexeme if interpreting, or to compile the lexeme if compiling.
dynamic context of a lexeme: information that is available at the time the lexeme is translated.
unqualified token: a tuple of arbitrary data objects that determines the interpretation semantics and the compilation semantics for a lexeme in its dynamic context.
token: unqualified token (a synonym, when it is clear from context).
to interpret a token: to perform the interpretation semantics that are determined by the token.
to compile a token: to perform the compilation semantics that are determined by the token.
to translate a token: to interpret the token if interpreting, or to compile the token if compiling.
token translator: a Forth definition that translates a token; also, depending on context, an execution token for this Forth definition.
resolver: a Forth definition that recognizes a lexeme producing a tuple of a token and its token translator.
token descriptor object: an implementation dependent data object (a set of information) that describes how to interpret and how to compile a token.
token descriptor: a value that identifies a token descriptor object; also, less formally and depending on context, a Forth definition that just returns this value, or a token descriptor object itself.
fully qualified token: a tuple of a token and its token descriptor.
recognizer: a Forth definition that recognizes a lexeme producing a fully qualified token.
simple recognizer: a recognizer that may produce the same token descriptor only.
compound recognizer: a recognizer that can produce the different token descriptors.
perceptor: a recognizer that is currently used by the Forth text interpreter to translate a lexeme.
default perceptor: the perceptor before it was changed by a program.
The need for some terms is obvious from the comparison of some proposals.
The less obvious term is perceptor. Why not just "current recognizer"? One argument is that "current" leads to longer names of the words. In general, we introduce new nouns to make things shorter. E.g., how to name the word that sets (or selects) a recognizer that will be used by the Forth text interpreter?
set-perceptor. The latter is shorter. The former also has wrong connection to the
set-current word (its name is suboptimal too).
A token brings all required information to interpret it and to compile it, if you know what kind this token is.
An "execution token" is a token, the same as a "token of a single cell number" is a token too.
- Having "execution token" on the stack, you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token.
- Having "name token" on the stack, you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token.
- Having "token of a single cell number" on the stack, you know how to interpret it, and how to compile it (since it represents the number itself). So, it brings all required semantics for translation. So, it's a token.
- Having "token of a double cell number" on the stack (it takes two cells), you know how to interpret it, and how to compile it (since it represents the number itself). So, it brings all required semantics for translation. So, it's a token.
- Having "string literal token" on the stack (it takes two cells), you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token.
If we want to allow a recognizer to have other effects beyond determining of the semantics for a lexeme, then the definition should be changed accordingly.
The definition of "to translate" can be extended to other modes.
We can replace "token" with another appropriate English noun. But not with "rectype", that isn't an English word, and has inappropriate etymology.
We can replace "descriptor" with "type", but the former sounds better for me. The words "type" and "class" have more abstract connotation than "descriptor". In any case, the corresponding object describes something. So "descriptor" looks like a good choice.
We can replace "preceptor" with another appropriate English noun, the better if it's a single word (see comparison of some variants).
: rectype-lit: ( xt -- ) ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ;
not so straightforward, but possible.
But not with "rectype", that isn't an English word, and has inappropriate etymology.
Indeed, "rectype" is an alternative not to "token" but to "token descriptor" (or just "descriptor", when it is obvious from the context).
In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator:
( i*x token -- j*x ).
I described this approach in comp.lang.forth in 2018 (news:email@example.com).
and then define generic rectypes just like in Matthias Trute's version with rectype:
I also shown, just for illustration, a hybrid variant, when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned above).
But the accessors from version D exclude some implementation approaches. Actually these accessors are useless when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors.
This works with this method, but not with the previous way.
Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too.
: create-rectype-for-literal ( xt-compiler "name" -- ) ['] noop swap dup rectype: ;
Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves
RECTYPE-SOMETYPE ( i*x state -- j*x )
By convention, the name for such a word should start from an English verb.
Concerning passing the state. In my Resolvers API, the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators.
: tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ;
: tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s r> r> tt-lit-s ;
Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly?
Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the proposal, let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers can share the same terminology.
Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like function type. But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and tells nothing about the recognizers (as Forth definitions).
A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators!
One advantage that the rectype-* names have over tokenclass-* is that the association with recognizers is more obvious.
But it should have association with tokens, not with recognizers!
What is etymology of "rectype" ?
I see the following disadvantages of "rectype":
- It's not an English word; it's an abbreviation, and it isn't explained.
- "rec" makes the first association with "record" that nothing to do with recognizers.
- "rectype" makes the first association that it's a type of record, that is wrong. The second association is that it's a type of a recognizer — that is also wrong.
- "rectype" describes a token, not a recognizer, but it refers to a recognizer.
E.g., why does the entity that describes "execution token" is called "rectype of execution token"? (My suggestion is: "descriptor of execution token").
not worth the costs of trying to find it, finding consensus on it, changing the existing code and documentation,
When we make a mistake, we pay for this mistake. "rectype" is a mistake in a name choice, — it seems, the most of us (who works on recognizers proposals) understand it, but didn't want to find the better name in an earlier stage, and postponed this choice to a later stage. And now you say this is not worth to change this name.
So we should have make this correction earlier. And now, I believe, we should pay the price, and this price worth this mistake.
Actually, the cost of changing the code is a weak argument. The internal code is not required to be updated (it's enough to make synonyms), and it also can be updated via auto-replacing.
The cost of changing the documentation is even a more weak argument. It can be updated via auto-replacing. But its terminology is wrong in any case, and this terminology should be fixed manually in the far more places.
Concerning "trying to find it, finding consensus on it" — we didn't even try it: we have only two alternative suggestions (perhaps, only one already). And nobody said that "rectype" by itself is better than "token descriptor" for our purpose.
Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ.
Combination with tick was discussed in news:firstname.lastname@example.org).
'(a::b.c) (without parentheses), where
a::b.c should be resolvable into an ordinary Forth word (i.e., that has default interpretation semantics).
'(a::b).cor something else?
('a)::(b.c) does not make sense since
'a is resolved into a single-cell number.
'(a::b).c does not make sense for the same reason.
Also, it should be a consequence of the specification for tick prefix (the corresponding recognizer or resolver), i.e. that it resolves the rest part into an ordinary Forth word.
This proposal isn't supposed to do that.
The Recognizer API (actually, all the versions and rewordings), minimalistic core API for recognizers, my Resolver API — all of them technically support the mentioned nesting, and they support implementation of independent recognizers for
'X and for
X::Y, and that
'X::Y will work automatically.
Compare: [...] with:
: rectype: create , , , ; :noname name>interpret execute ; :noname name>compile execute ; :noname name>compile swap lit, compile, ; rectype: rectype-nt
(sic: the full postpone action).
This comparison is incorrect since in the proposed API
rectype: (that generates a token translator) can be defined as the following:
: rectype: ( xt-executer xt-compiler xt-postponer "name" -- ) >r >r >r : ]] 0 of [[ r> xt, ]] endof -1 of [[ r> xt, ]] endof -2 of [[ r> xt, ]] endof -22 throw endcase [[ postpone ; ;
And you can use the same your code to define your
rectype-nt or anything else.