Digest #313 2025-08-25

Contributions

[408] 2025-08-24 15:44:45 agsb wrote:

proposal - A easy visual separator

Author:

Alvaro Gomes Sobral Barcellos

Change Log:

none

Problem:

In word definitions, component words form sets with specific purposes. These groups can be more visually highlighted using something as bullet points.

Solution:

Use a word pipe : | ;

Typical use: (Optional)

Proposal:

The pipe ( | ) separates conjunts of words, grouped in blocks for easy visualisation and could serve as place holder for actions while debug.

Reference implementation:

: | ;

Testing: (Optional)

Replies

[r1512] 2025-08-21 08:31:43 AntonErtl replies:

requestClarification - behavior on newline

The standard clearly specifies in 3.4.1 that the parsing of ." ends when the parse area (line or block) ends. So if that definition is on the command line or in a file, it is equivalent to

: a ." multiline?" s" char " ;

If that definition is in a single block (with 64-char "lines"), it is equivalent to

: a ." multiline?                                               s" [char] " ;

[r1513] 2025-08-21 08:40:17 AntonErtl replies:

requestClarification - Is behavior ambiguous if a name cannot be parsed?

The standard does not state that this is an ambiguous condition for this word, but using [UNDEFINED] without a name is most likely a programming mistake rather than intentional, so I think that this may be something we might fix in the future if we ever get around to it.

[r1514] 2025-08-21 09:13:19 AntonErtl replies:

referenceImplementation - Suggested reference implementation

Using CMOVE (as it is implemented in different ways in many Forth systems) for FILL the way you do is actually very slow on modern CPUs. On a Ryzen 5800X it consumes 7 cycles per byte. See news:2021Sep1.233440@mips.complang.tuwien.ac.at, which also explains how to implement CMOVE more efficiently for this kind of usage (eventually I made a EuroForth 2021 presentation about this topic). But it is more efficient to implement FILL without calling CMOVE, and for the reference implementation, I think that the byte-storing variant is preferable; on fast Forth systems it is even faster than the CMOVE-based variant, by a lot.

[r1515] 2025-08-21 18:44:28 JimPeterson replies:

referenceImplementation - Possible Reference Implementation

I definitely agree that a reference implementation that relies on two's complement is an environmental dependency until the standard insists on two's complement.

I don't really see how the implementation I propose is insufficient for your system, unless 0<> does not return either 0 or -1. I see that TRUE is declared to return "a single-cell value with all bits set", so maybe that doesn't translate to -1 in your system?

Despite these points, I'm far more interested in the other question: should reference implementations proposed here focus more towards concisely conveying the intent of the word (like 0. 2SWAP D-), or should they attempt to offer reasonable implementations that can be used to fill out a nascent system (such as something that does not cyclically reference D-'s implementation as DNEGATE D+)?

[r1516] 2025-08-21 21:02:20 EricBlake replies:

referenceImplementation - Possible Reference Implementation

I don't really see how the implementation I propose is insufficient for your system, unless 0<> does not return either 0 or -1. I see that TRUE is declared to return "a single-cell value with all bits set", so maybe that doesn't translate to -1 in your system?

In a twos-complement system: -a cell with all bits set is -1 when interpreted as a signed number.
In a ones-complement system, -1 is equal to 1 INVERT (so it has most bits set, but the least significant cleared); the value with all bits set is -0 (except that the standard points out that the usual arithmetic operations like + and * should never result in a -0; you can only get it as a flag or via bit manipulations).
In a sign-magnitude system, -1 is equal to 1 SIGN-BIT XOR (so it has only 2 bits set); the value with all bits set is the negative of the maximum positive signed value; this representation also has a -0, but that only has one bit set. Historical sign-magnitude machines and IEEE floating point uses the most significant bit of storage as the sign bit, at least when converting the bit pattern to the corresponding unsigned value; but it is also possible to have sign-magnitude where the sign bit is adjacent to the least-significant bit.

In all three of those numeric encodings, the expressions 1 -1 * and 1 NEGATE will result in -1, but it is not usable as a canonical flag since the number of bits set differs by the encoding. Meanwhile, a cell with all bits set treated as an unsigned integer would be the maximum unsigned integer - but only if the system does not use the standard's escape clause of capping the maximum u value at the same as the maximum d value rather than using the full cell.

(And there's the historical mess that older Forth standards allowed systems where TRUE was equal to 1 rather than all bits set, matching some common hardware setups and the C language)

It is because the interpretation of a cell with all bits set has three different values in the three different encodings that Forth declares it to be ambiguous behavior if you ever perform math directly on a flag value; at most, you are guaranteed that you can perform bitwise logical operations on a flag value plus an integer to then result in an integer or zero (that is, a b = 1+ is ambiguous, but a b = -1 XOR 1+ is well-defined). In fact, because all bits set in ones complement is -0, the standard is careful to describe non-canonical flags as a cell with at least one bit set (rather than a cell with a non-zero value). Of course, when the next version of Forth requires twos-complement, it should also get rid of the ambiguity of doing math on a flag value, and we could do an editorial cleanup of all other places where the standard was dancing around concessions to ones-complement or sign-magnitude cell behavior.

On my particular system atop a VM with no fixed cell size, and for that matter, no native negative number support, I have found it easier to actually encode positive vs. negative by using a sign-magnitude with the least-significant bit as the sign bit, dispatching to the right variant of + or * based on the sign bit, then shifting that bit away before using the VM's native math on the resulting unsigned values. But when it comes to other operations, like XOR, I'm finding it easier to just declare that I convert a number between positive and negative as if by twos-complement modular arithmetic rules (even if that's not what is actually happening in the underlying bit representation). At which point, I'm finding it easier to declare that by fiat, -1 behaves as if it has all bits set (even though in the VM representation it may only have 2 bits set).

Despite these points, I'm far more interested in the other question: should reference implementations proposed here focus more towards concisely conveying the intent of the word (like 0. 2SWAP D-), or should they attempt to offer reasonable implementations that can be used to fill out a nascent system (such as something that does not cyclically reference D-'s implementation as DNEGATE D+)?

As a user, I prefer concise implementations that convey intent, even if it leads to circular references. As an implementer, I would prefer the extra leg up in having a topological sorting on how to build bigger pieces out of smaller building blocks, even if that leads to more verbosity or non-optimal implementations. I guess it boils down to deciding whether the standard is intended to be more user-friendly or more implementer-friendly; in my book, I think the balance swings in favor of users.

[r1517] 2025-08-21 21:58:38 EricBlake replies:

referenceImplementation - Possible reference implementation

This implementation requires SPACES to gracefully ignore negative input; although it has been questioned whether that is intended by the standard: https://forth-standard.org/standard/core/SPACES#contribution-337. Note that this implementation does not use S>D; doing so would produce the wrong results on a twos-complement machine for the minimum integer value.

: .R ( n1 n2 -- ) \ "dot-r"
  SWAP DUP ABS 0 <# #S ROT SIGN #> ( n2 c-addr u )
  ROT OVER - SPACES TYPE
;

[r1518] 2025-08-22 05:01:07 EricBlake replies:

proposal - minimalistic core API for recognizers

From r1412

SCAN-TRANSLATE-STRING ( addr1 u1 string-rest<"> -- addr2 u2 | ) RECOGNIZER EXT

Complete parsing a string: addr1 u1 consists of the starting quote and additional characters up to the first space in the string. addr2 u2 consists of the entire string without the starting quote up to (but not including) the final quote, and translated the escape sequences according to the rules of S\". >IN is modified appropriately, and points just after the final quote. If there's no final quote in the current line, REFILL can be used to read in more lines, adding corresponding newlines into the string. The final quote can be inside addr1 u1, setting >IN backwards in that case.

...

** TRANSLATE-STRING** ( addr1 u1 -- addr1 u1 | ) RECOGNIZER EXT

Translate the string:

...

?SCAN-STRING ( addr1 u1 scan-translate-string string-rest<"> -- addr2 u2 translate-string | ... translator -- ... translator ) RECOGNIZER

If the recognized token is an incompleted string, complete the scanning as defined for SCAN-TRANSLATE-STRING and replace the translator with the xt of TRANSLATE-STRING.

REC-STRING ( addr u -- addr u translate-string | 0/NOTFOUND ) RECOGNIZER EXT

Check if addr u starts with a quote, and return that string and the xt of SCAN-TRANSLATE-STRING if it does, 0/NOTFOUND otherwise.

the listed stack effect for rec-string (addr u translate-string) disagrees with the explanatory text (using scan-translate-string on success). Also, why is ?SCAN-STRING grouped in RECOGNIZER, rather than RECOGNIZER EXT?

Then in r1427

TRANSLATE-STRING ?SCAN-STRING

What are these words good for? REC-STRING apparently does not need them.

I'm playing around with implementing these in my Forth, and ran into the following. I can absolutely see why rec-string would need to return scan-translate-string instead of translate-string (the interpreter loop parses one space-separated lexeme at a time: If it sees "abc" as that lexeme, then rec-string can return translate-string with no issues on the 3-char sequence abc; but if it sees "a as the space-separated lexeme from an input buffer of "a b", rec-string absolutely must return scan-translate-string as the translator, so that the recognizer does not have to have any side effects beyond the one lexeme it is currently staring at - only translators should have side effects. So the proposal should be updated to call out the correct translator(s) returned by rec-string (gforth's implementation always returns scan-translate-string, but in my implementation, I had success at returning the faster translate-string when the lexeme did not require a further scan or a backwards adjustment of >IN, for less work later - I'm trying to avoid repeated scanning in my implementation).

Would it make sense to allow REC-STRING to have a stack effect ( addr1 u1 -- addr2 u2 scan-translate-string )? Or put another way, must scan-translate-string start at the same leading " that rec-string recognized with all \ sequences prior to the end of the lexeme still intact (implied when addr is the same on both sides of -- in the stack effect), or can I optimize to have scan-translate-string be passed the address of a transient buffer that contains the results after \ have been handled, and where it can then append the further results of \ handling on the scan to the closing ". In fact, if we want to allow the recognizer to (sometimes) return translate-string instead of always returning scan-translate-string on success, then it is imperative to allow a different addr on return than on entry (since unlike scan-translate-string, it seems like the intent is for ?scan-string to lock down the parse so that regular translate-string can then work in the future regardless of whether >IN has been modified in the meantime).

Another issue I hit: the rules for PARSE-NAME (or BL WORD COUNT) are clear that ALL control characters can be treated as a delimiter (not just a literal BL); but CHAR " PARSE does not get the same liberty. Depending on how lexemes were separated out prior to calling the recognizer, a choice of a different control character as whitespace (such as tab) is lost: the c-addr u passed to rec-string does NOT see which actual character was used as the delimiter, only that the lexeme was delimited, and >IN is already pointing past that whitespace. For the lexeme "abc", this does not matter (the concluding " was present in the lexeme, so the fact that there was subsequent whitespace doesn't affect the next invocation of parse-name to find the next lexeme); but for the input buffer s\" \"a b\"" evaluate vs. s\" \"a\tb\"" evaluate, the correct resulting translation should distinguish between space and tab, even though the delimiter was already eaten by the time the recognizer visits the initial lexeme "a. So an implementation of scan-translate-string that merely appends a space before calling CHAR " PARSE to pick up the rest of the input line to append to the existing string is not quite correct, when compared to an implementation that remembers the delimiter or rewinds >IN before resuming the scan (and then it's a matter of optimization whether it rewinds all the way back to the leading " that the recognizer saw, or whether it can merely rewind to the delimiter and then append to the already-parsed prefix). The proposal should probably be more explicit about the difference in whitespace handling when using PARSE-NAME to separate out initial lexemes to hand to a recognizer, vs. PARSE used to find the final " to end an incomplete string.

The other thing I'm wondering is whether there is any possible way to avoid needing to set >IN backwards as part of scan-translate-string (the standard is clear that in a full environment, >IN can be set backwards; but in a restricted environment, being able to do work with only forward progress can be more efficient). Would it make sense to allow rec-string to sometimes have a stack effect of ( addr1 u1 -- ... translator addr2 u2 translate-string translate-compound ), where translate-compound has a stack effect ( ...2 translator2 ...1 translator1 -- ...1 ...2 ) of performing the semantics of translator1 followed by translator2 (including the case where a nested translate-compound could be one of those embedded translators)? That way, rec-string on the lexeme "a"1 could determine that the string is completely parsed within addr u, AND run a nested recognizer on the suffix of the current lexeme, finally returning 1 translate-num addr2 u2 translate-string translate-compound, and thereby avoiding the need to rewind >IN?

Is it okay that rec-string defers to scan-translate-string even if the lexeme does not contain a well-formed string? For example, S\" a\xQ" is ambiguous because Q does not satisfy the fact that \x must be followed by two hexdigits - but if rec-string on the lexeme "a\xQ" returns a translator rather than 0, then any recognizer installed later in the sequence (that might understand a different set of \ escape characters) will not get a shot at the lexeme, because rec-string already claimed it even though scan-translate-string will trigger ambiguous behavior. Answering this determines whether rec-string can be simple (merely look for a first character of ", without regards to the rest of the string) or must do \ processing. (The fact that the standard says \xQ triggers ambiguous behavior but does not have a dedicated throw code for it is a matter for a different day).