Digest #140 2021-04-05
LEAVE be implemented in such a way that its compilation semantics have stack effect
( C: do-sys1 i*x -- do-sys2 i*x )?
LEAVE is implemented in this way in my DO LOOP over BEGIN UNTIL proof of concept. This PoC also relies on variable size of data objects of
do-sys data type, but
LEAVE can be also implemented in such a way that the size of
do-sys1 is equal to the size of
I don't think this proposal works for extended characters. While
$a4 emit works for
ä this explicitly doesn't work (if my understanding is correct) for unicode positions that are multicharacter in UTF8. You have to know at the point of emitting, what the expected coding is.
If we define it as UTF8 then EMIT can know that the byte is part of a multi-byte character, and hold it until it gets the next byte before passing to the operating system, but at the moment I don't believe that Forth 2012 is defined as UTF8, so a conformant system would have to emit that first byte (which I think will have its top bit set) as a character.
I'm not suggesting what I've done is the right solution - but I think any proposal understanding by someone with better understanding (than me, or the proposer of this) of how Unicode and UTF8 work before changes are made.
the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes.
Yes, but 18.3.5 says: "IO words such as
EMIT, [...] operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s)".
So if a system implements the optional Extended-Character word set then the argument of
EMIT shall be pchar. I.e., it suggests that in such case
EMIT cannot deal with extended characters (code points) that are not pchars. The problem is that if
EMIT accepts code points that are not pchars then it cannot handle several calls with pchar as a single xchar, and vise versa — since pchar is a subset of xchar.
EMIT ( char -- )
Send char as raw byte to the user output device.
I would suggest to rely on primitive characters instead of bytes (since the Standard actually does not use "byte" notion):
EMIT ( pchar -- )
Send the primitive character pchar to the user output device.
Perhaps the primitive character data type (pchar) should be included into Table 3.1, since it's used in 184.108.40.206, and in A.6.1.1750. OTOH it's not clear what are the particular differences between the "primitive character" (pchar) and "character" (char) data types.
I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the console via
chcp 65001 command), and in Linux.
HEX C3 EMIT A4 EMIT
In SP-Forth the word
EMIT is implemented via
TYPE (that is via
In the test
HEX C3 EMIT KEY DROP A4 EMIT
we can see that after the first emit nothing is shown, and after the second emit the character ä is shown.
So the possible options to support extended characters beyond pchar are:
- Define pchar size as 16 bits (an easy case), and use UTF-16. Also, use address units of 16 bits, or address units of 8 bits with environmental restrictions.
- Define pchar size as 8 bits and handle UTF-8 by yourself (to convert it into UTF-16).
Another option is to not support extended characters beyond pchar at all. In this case pchar size can be 8, 16, or 32 bits.