Digest #140 2021-04-05

Contributions

[185] 2021-04-04 10:35:03 ruv wrote:

requestClarification - Stack effect of LEAVE during compilation

May LEAVE be implemented in such a way that its compilation semantics have stack effect ( C: do-sys1 i*x -- do-sys2 i*x )?

For illustration, LEAVE is implemented in this way in my DO LOOP over BEGIN UNTIL proof of concept. This PoC also relies on variable size of data objects of do-sys data type, but LEAVE can be also implemented in such a way that the size of do-sys1 is equal to the size of do-sys2.

Replies

[r626] 2021-04-04 01:53:23 MitraArdron replies:

proposal - EMIT and non-ASCII values

I don't think this proposal works for extended characters. While $a4 emit works for ä this explicitly doesn't work (if my understanding is correct) for unicode positions that are multicharacter in UTF8. You have to know at the point of emitting, what the expected coding is.

If we define it as UTF8 then EMIT can know that the byte is part of a multi-byte character, and hold it until it gets the next byte before passing to the operating system, but at the moment I don't believe that Forth 2012 is defined as UTF8, so a conformant system would have to emit that first byte (which I think will have its top bit set) as a character.

For webForth in C (on Arduino) I feed the characters to Serial.write which (I think) treats it as UTF8, but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization. This is also a LOT faster than passing characters individually to a string oriented system anyway.

I'm not suggesting what I've done is the right solution - but I think any proposal understanding by someone with better understanding (than me, or the proposer of this) of how Unicode and UTF8 work before changes are made.


[r627] 2021-04-04 15:15:25 ruv replies:

proposal - EMIT and non-ASCII values

the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes.

Yes, but 18.3.5 says: "IO words such as KEY, EMIT, [...] operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s)".

So if a system implements the optional Extended-Character word set then the argument of EMIT shall be pchar. I.e., it suggests that in such case EMIT cannot deal with extended characters (code points) that are not pchars. The problem is that if EMIT accepts code points that are not pchars then it cannot handle several calls with pchar as a single xchar, and vise versa — since pchar is a subset of xchar.

EMIT ( char -- )
Send char as raw byte to the user output device.

I would suggest to rely on primitive characters instead of bytes (since the Standard actually does not use "byte" notion):

EMIT ( pchar -- )
Send the primitive character pchar to the user output device.

Perhaps the primitive character data type (pchar) should be included into Table 3.1, since it's used in 3.1.2.3, and in A.6.1.1750. OTOH it's not clear what are the particular differences between the "primitive character" (pchar) and "character" (char) data types.

Existing practice

I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the console via chcp 65001 command), and in Linux. The test:

HEX  C3 EMIT A4 EMIT

outputs ä

In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).

In the test

HEX  C3 EMIT  KEY DROP  A4 EMIT

we can see that after the first emit nothing is shown, and after the second emit the character ä is shown.


[r628] 2021-04-04 15:57:41 ruv replies:

proposal - EMIT and non-ASCII values

but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization.

Regardless of the encoding in the source code files, JavaScript uses UTF-16 encoding for strings. When you set UTF-8 — it's only about the encoding of the source code file/stream.

So the possible options to support extended characters beyond pchar are:

  1. Define pchar size as 16 bits (an easy case), and use UTF-16. Also, use address units of 16 bits, or address units of 8 bits with environmental restrictions.
  2. Define pchar size as 8 bits and handle UTF-8 by yourself (to convert it into UTF-16).

Another option is to not support extended characters beyond pchar at all. In this case pchar size can be 8, 16, or 32 bits.