,---------------. | Contributions | `---------------´ ,------------------------------------------ | 2020-09-07 13:56:43 ruv wrote: | proposal - Common terminology for recognizers discurse and specifications | see: https://forth-standard.org/proposals/common-terminology-for-recognizers-discurse-and-specifications#contribution-161 `------------------------------------------ ## Author Ruv ## Change Log - 2020-09-07 The document is published at forth-standard.org ## Problem The different proposals about recognizers use the different terminology that conflict with each other and with the language of the Standard. We have many examples of that. Also, the wrong terminology produces the names for words that have confusing etymology. And after all, it discredits the Forth language in the more wide community of programmers. ## Solution: Let's use the common terminology that is correct, accurately defined, and compatible with the Standard language. I suggest the following one. The latest version is [available at GitHub](https://github.com/ForthHub/fep-recognizer/blob/master/terms-and-datatypes.md). Improvements are welcome. ## Proposal **tuple**: a logical union of several elements that keeps their order; when a tuple is placed into the data stack, the rightmost element in writing is the topmost on the stack, and floating-point numbers are placed into the floating-point stack. **lexeme**: a syntactic unit of a _program_ (a source code); unless otherwise noted, it is a sequence of non-blank characters delimited by a blank. to **recognize** a _lexeme_: to determine the _interpretation semantics_ and the _compilation semantics_ for the _lexeme_ in the current dynamic context. to **interpret** a _lexeme_: to perform the _interpretation semantics_ for the _lexeme_ in the current dynamic context. to **compile** a _lexeme_: to perform the _compilation semantics_ for the _lexeme_ in the current dynamic context. to **translate** a _lexeme_: to _interpret_ the _lexeme_ if interpreting, or to _compile_ the _lexeme_ if compiling. **dynamic context** of a _lexeme_: information that is available at the time the _lexeme_ is _translated_. **unqualified token**: a tuple of arbitrary _data objects_ that determines the _interpretation semantics_ and the _compilation semantics_ for a _lexeme_ in its _dynamic context_. **token**: _unqualified token_ (a synonym, when it is clear from context). to **interpret** a _token_: to perform the _interpretation semantics_ that are determined by the token. to **compile** a _token_: to perform the _compilation semantics_ that are determined by the token. to **translate** a _token_: to _interpret_ the _token_ if interpreting, or to _compile_ the _token_ if compiling. **token translator**: a _Forth definition_ that translates a _token_; also, depending on context, an _execution token_ for this Forth definition. **resolver**: a _Forth definition_ that recognizes a _lexeme_ producing a tuple of a _token_ and its _token translator_. **token descriptor object**: an _implementation dependent_ _data object_ (a set of information) that describes how to interpret and how to compile a _token_. **token descriptor**: a value that identifies a _token descriptor object_; also, less formally and depending on context, a Forth definition that just returns this value, or a _token descriptor object_ itself. **fully qualified token**: a tuple of a _token_ and its _token descriptor_. **recognizer**: a _Forth definition_ that recognizes a _lexeme_ producing a _fully qualified token_. **simple recognizer**: a _recognizer_ that may produce the same _token descriptor_ only. **compound recognizer**: a _recognizer_ that can produce the different _token descriptors_. **perceptor**: a _recognizer_ that is currently used by the Forth text interpreter to translate a _lexeme_. **default perceptor**: the _perceptor_ before it was changed by a program. ## Rationale The need for some terms is obvious from the [comparison of some proposals](https://gist.github.com/ruv/af796cece2ecd2ee541d883a04483dcc#file-11-comparision-to-some-past-versions-md). ### "Perceptor" The less obvious term is _perceptor_. Why not just "current recognizer"? One argument is that "current" leads to longer names of the words. In general, we introduce new nouns to make things shorter. E.g., how to name the word that sets (or selects) a recognizer that will be used by the Forth text interpreter? We have: `set-current-recognizer` vs `set-perceptor`. The latter is shorter. The former also has wrong connection to the [`set-current`](/standard/search/SET-CURRENT) word (its name is suboptimal too). ### "Token" A token brings all required **information to interpret it and to compile it**, if you know what kind this token is. An "execution token" is a token, the same as a "token of a single cell number" is a token too. Examples: - Having "execution token" on the stack, you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token. - Having "name token" on the stack, you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token. - Having "token of a single cell number" on the stack, you know how to interpret it, and how to compile it (since it represents the number itself). So, it brings all required semantics for translation. So, it's a token. - Having "token of a double cell number" on the stack (it takes two cells), you know how to interpret it, and how to compile it (since it represents the number itself). So, it brings all required semantics for translation. So, it's a token. - Having "string literal token" on the stack (it takes two cells), you know how to interpret it, and how to compile it. So, it brings all required semantics for translation. So, it's a token. ## Discussion If we want to allow a recognizer to have [other effects](https://github.com/ForthHub/fep-recognizer/issues/7) beyond determining of the semantics for a lexeme, then the definition should be changed accordingly. The definition of "to translate" can be extended to other modes. We can replace "token" with another appropriate **English noun**. But not with "rectype", that **isn't an English word**, and has inappropriate etymology. We can replace "descriptor" with "type", but the former sounds better for me. The words "type" and "class" have more abstract connotation than "descriptor". In any case, the corresponding object describes something. So "descriptor" looks like a good choice. We can replace "preceptor" with another appropriate **English noun**, the better if it's a single word (see comparison of [some variants](https://github.com/ForthHub/fep-recognizer/issues/3)). ,---------. | Replies | `---------´ ,------------------------------------------ | 2020-09-07 10:20:28 JennyBrien replies: | proposal - minimalistic core API for recognizers | see: https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-505 `------------------------------------------ ``` : rectype-lit: ( xt -- ) ['] noop swap dup >r :noname r@ compile, r> postpone literal postpone compile, postpone ; rectype: ; ``` not so straightforward, but possible. ,------------------------------------------ | 2020-09-07 14:03:37 ruv replies: | proposal - Common terminology for recognizers discurse and specifications | see: https://forth-standard.org/proposals/common-terminology-for-recognizers-discurse-and-specifications#reply-506 `------------------------------------------ Correction: > But not with "rectype", that isn't an English word, and has inappropriate etymology. Indeed, "rectype" is an alternative not to "token" but to "token descriptor" (or just "descriptor", when it is obvious from the context). ,------------------------------------------ | 2020-09-07 15:10:04 ruv replies: | proposal - minimalistic core API for recognizers | see: https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-507 `------------------------------------------ ### Previous works In general, I like the approach of active "rectype", i.e. when you can execute it to translate a token — so a "rectype" is a token translator: `( i*x token -- j*x )`. I described this approach in comp.lang.forth in 2018 (news:[pngvcc$pta$1@gioia.aioe.org](https://groups.google.com/forum/message/raw?msg=comp.lang.forth/8orqw1vjTOY/wMskqvDWCAAJ)). Bernd [should also remember](https://groups.google.com/forum/message/raw?msg=comp.lang.forth/pN6PINgofvs/d4twEqyzAAAJ) comparison of version D with [Resolvers API](https://github.com/ruv/forth-design-exp/blob/master/docs/resolver-api.md), where I specified this approach, and even [several POCs](https://github.com/ruv/forth-design-exp/tree/master/lexeme-translator/variation). > and then define generic rectypes just like in Matthias Trute's version with rectype: I also shown, just for illustration, a [hybrid variant](https://github.com/ruv/forth-design-exp/blob/master/lexeme-translator/variation/ttoken.rectype.fth), when "rectype" can be executed and be an argument of the accessors (and it also is compatible with version D, i.e. it is a "passive rectype" as JennyBrien mentioned [above](https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-500)). But the accessors from version D exclude some implementation approaches. Actually these [accessors are useless](https://github.com/ForthHub/fep-recognizer/issues/6) when the higher methods are provided. Getting an xt and then executing this xt has an excessive step without any profit in the most cases. Let's provide the corresponding methods instead of the accessors. > This works with this method, but not with the previous way. Don't sure what you refer to, but "automatic postpone for literals" can be implemented in version D too. ``` : create-rectype-for-literal ( xt-compiler "name" -- ) ['] noop swap dup rectype: ; ``` ### Token translator > Make the recognizer types executable to dispatch the methods (interpret, compile, postpone) themselves > `RECTYPE-SOMETYPE ( i*x state -- j*x )` By convention, the name for such a word should start from an English **verb**. Concerning passing the state. In my [Resolvers API](https://github.com/ruv/forth-design-exp/blob/master/docs/resolver-api.md), the state is passed indirectly, i.e. not via the stack. It makes more easy the combinations of translators. E.g.: ``` : tt-3lit ( 3*x -- 3*x | ) >r tt-2lit r> tt-lit ; ``` VS ``` : tt-3lit-s ( 3*x state -- 3*x | ) dup >r swap >r tt-2lit-s r> r> tt-lit-s ; ``` Passing the state is cumbersome. Also, take into account that it's usually already kept in a variable in any way. Why do you need to pass it via the stack again and again? What is a rationale for passing it directly? ### Terminology Please stop using the confusing terminology such as "data type id" (in "The core principle is still that the recognizer is not aware of state, and the returned data type id is"). This terminology is not compatible with the language of the standard. I suggested the proper terminology before and have published on forth-standard.org now the [proposal](https://forth-standard.org/proposals/common-terminology-for-recognizers-discurse-and-specifications#contribution-161), let's use it (and let's make it better, if any), or let's accurately define another terminology. The fact is that all the proposals about recognizers **can share the same terminology**. Another example is "recognizer types" term. If a recognizer is a Forth definition having particular behavior, then "recognizer type" is "type of a recognizer", that is a type of a Forth definition, something like [function type](https://en.wikipedia.org/wiki/Function_type). But actually you mean a "token descriptor", that is "descriptor of a token", that tells something about the corresponding token, and **tells nothing about the recognizers** (as Forth definitions). ,------------------------------------------ | 2020-09-07 15:58:29 ruv replies: | proposal - minimalistic core API for recognizers | see: https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-508 `------------------------------------------ ### Advantages A huge advantage of this approach (but when the state is passed indirectly) is that the most user-defined token translators can be created far easily than the corresponding descriptors ("rectypes"). You don't need to cope with three actions, and you don't need to cope with the state at all, since any token translator can be created via other already defined translators! ,------------------------------------------ | 2020-09-07 16:50:59 ruv replies: | proposal - Recognizer | see: https://forth-standard.org/proposals/recognizer#reply-509 `------------------------------------------ > One advantage that the rectype-* names have over tokenclass-* is that the association with recognizers is more obvious. But it should have association with tokens, not with recognizers! What is etymology of "rectype" ? I see the following disadvantages of "rectype": 1. It's not an English word; it's an abbreviation, and it isn't explained. 2. "rec" makes the first association with "record" that nothing to do with recognizers. 3. "rectype" makes the first association that it's a type of record, that is wrong. The second association is that it's a type of a recognizer — that is also wrong. 4. "rectype" describes a token, not a recognizer, but it refers to a recognizer. E.g., why does the entity that describes "execution token" is called "rectype of execution token"? (My suggestion is: "descriptor of execution token"). > not worth the costs of trying to find it, finding consensus on it, changing the existing code and documentation, When we make a mistake, we pay for this mistake. "rectype" is a mistake in a name choice, — it seems, the most of us (who works on recognizers proposals) understand it, but didn't want to find the better name in an earlier stage, and postponed this choice to a later stage. And now you say this is not worth to change this name. So we should have make this correction earlier. And now, I believe, we should pay the price, and this price worth this mistake. Actually, the cost of changing the code is a weak argument. The internal code is not required to be updated (it's enough to make synonyms), and it also can be updated via auto-replacing. The cost of changing the documentation is even a more weak argument. It can be updated via auto-replacing. But its terminology is wrong in any case, and this terminology should be fixed manually in the far more places. Concerning "trying to find it, finding consensus on it" — we didn't even try it: we have only two alternative suggestions (perhaps, only one already). And nobody said that "rectype" by itself is better than "token descriptor" for our purpose. ,------------------------------------------ | 2020-09-07 17:12:08 BerndPaysan replies: | proposal - minimalistic core API for recognizers | see: https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-510 `------------------------------------------ Yes, I proposed that kind of solution years ago. In effect, both ways have the same expressive power, but one does it by creation of noname words, the other by normal code. Acceptance may differ. ,------------------------------------------ | 2020-09-07 17:29:43 ruv replies: | proposal - An alternative to the RECOGNIZER proposal | see: https://forth-standard.org/proposals/an-alternative-to-the-recognizer-proposal#reply-511 `------------------------------------------ Re point 4, see the related discussions in comp.lang.forth, news:[pot8vf$811$1@gioia.aioe.org](https://groups.google.com/forum/message/raw?msg=comp.lang.forth/pN6PINgofvs/zOy-rEIXAwAJ) (and an [implementation](https://github.com/ruv/forth-design-exp/commit/178f4eab3dc693ca50c5fa0412de10161d617865#diff-a5aa33ef94d1587231d256585be78cd3)). Combination with tick was discussed in news:[rd9sh6$r8v$1@dont-email.me](https://groups.google.com/forum/message/raw?msg=comp.lang.forth/yuNZEvq8EqA/W5ICEQlTAgAJ)). > What does `'a::b.c` mean, anyway? It means `'(a::b.c)` (without parentheses), where `a::b.c` should be resolvable into an ordinary Forth word (i.e., that has default interpretation semantics). > Is it `('a)::(b.c)` or `'(a::b).c` or something else? `('a)::(b.c)` does not make sense since `'a` is resolved into a single-cell number. `'(a::b).c` does not make sense for the same reason. Also, it should be a consequence of the specification for tick prefix (the corresponding recognizer or resolver), i.e. that it resolves the rest part into an ordinary Forth word. ,------------------------------------------ | 2020-09-07 17:41:42 ruv replies: | proposal - An alternative to the RECOGNIZER proposal | see: https://forth-standard.org/proposals/an-alternative-to-the-recognizer-proposal#reply-512 `------------------------------------------ > This proposal isn't supposed to do that. The [Recognizer API](/proposals/recognizer#contribution-142) (actually, all the versions and rewordings), [minimalistic core API for recognizers](https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#contribution-160), my [Resolver API](https://github.com/ruv/forth-design-exp/blob/master/docs/resolver-api.md) — all of them technically support the mentioned nesting, and they support implementation of independent recognizers for `'X` and for `X::Y`, and that `'X::Y` will work automatically. ,------------------------------------------ | 2020-09-07 18:02:52 ruv replies: | proposal - minimalistic core API for recognizers | see: https://forth-standard.org/proposals/minimalistic-core-api-for-recognizers#reply-513 `------------------------------------------ @JennyBrien wrote > Compare: [...] > with: ``` : rectype: create , , , ; :noname name>interpret execute ; :noname name>compile execute ; :noname name>compile swap lit, compile, ; rectype: rectype-nt ``` (sic: the full postpone action). This comparison is incorrect since in the proposed API `rectype:` (that generates a token translator) can be defined as the following: ``` : rectype: ( xt-executer xt-compiler xt-postponer "name" -- ) >r >r >r : ]] 0 of [[ r> xt, ]] endof -1 of [[ r> xt, ]] endof -2 of [[ r> xt, ]] endof -22 throw endcase [[ postpone ; ; ``` And you can use the same your code to define your `rectype-nt` or anything else.