,---------------.
| Contributions |
`---------------´
,------------------------------------------
| 2021-04-04 10:35:03 ruv wrote:
| requestClarification - Stack effect of LEAVE during compilation
| see: https://forth-standard.org/standard/core/LEAVE#contribution-185
`------------------------------------------
May `LEAVE` be implemented in such a way that its compilation semantics have stack effect `( C: do-sys1 i*x -- do-sys2 i*x )`?
For illustration, `LEAVE` is implemented in this way in my [DO LOOP over BEGIN UNTIL](https://gist.github.com/ruv/0b0bfdbe2759254a5318d76f9b05d262) proof of concept. This PoC also relies on [variable size](/standard/usage#contribution-183) of data objects of `do-sys` data type, but `LEAVE` can be also implemented in such a way that the size of `do-sys1` is equal to the size of `do-sys2`.
,---------.
| Replies |
`---------´
,------------------------------------------
| 2021-04-04 01:53:23 MitraArdron replies:
| proposal - EMIT and non-ASCII values
| see: https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-626
`------------------------------------------
I don't think this proposal works for extended characters. While `$a4 emit` works for `ä` this explicitly doesn't work (if my understanding is correct) for unicode positions that are multicharacter in UTF8. You have to know at the point of emitting, what the expected coding is.
If we define it as UTF8 then EMIT can know that the byte is part of a multi-byte character, and hold it until it gets the next byte before passing to the operating system, but at the moment I don't believe that Forth 2012 is defined as UTF8, so a conformant system would have to emit that first byte (which I think will have its top bit set) as a character.
For webForth in C (on Arduino) I feed the characters to Serial.write which (I think) treats it as UTF8, but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization. This is also a LOT faster than passing characters individually to a string oriented system anyway.
I'm not suggesting what I've done is the right solution - but I think any proposal understanding by someone with better understanding (than me, or the proposer of this) of how Unicode and UTF8 work before changes are made.
,------------------------------------------
| 2021-04-04 15:15:25 ruv replies:
| proposal - EMIT and non-ASCII values
| see: https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-627
`------------------------------------------
> the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes.
Yes, but 18.3.5 says: "IO words such as `KEY`, `EMIT`, [...] operate on _pchars_. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s)".
So if a system implements the optional Extended-Character word set then the argument of `EMIT` shall be pchar. I.e., it suggests that in such case `EMIT` cannot deal with extended characters (code points) that are not pchars. The problem is that if `EMIT` accepts code points that are not pchars then it cannot handle several calls with pchar as a single xchar, and vise versa — since pchar is a subset of xchar.
> EMIT ( char -- )
Send char as raw byte to the user output device.
I would suggest to rely on primitive characters instead of bytes (since the Standard actually does not use "byte" notion):
**EMIT ( pchar -- )
Send the primitive character _pchar_ to the user output device.**
Perhaps the primitive character data type (_pchar_) should be included into [Table 3.1](/standard/usage#table:datatypes), since it's used in [3.1.2.3](/standard/usage#subsubsection.3.1.2.3), and in [A.6.1.1750](/standard/rationale#rat:core:KEY). OTOH it's not clear what are the particular differences between the "primitive character" (_pchar_) and "character" (_char_) data types.
> Existing practice
I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the console via `chcp 65001` command), and in Linux.
The test:
```
HEX C3 EMIT A4 EMIT
```
outputs __ä__
In SP-Forth the word `EMIT` is implemented via `TYPE` (that is via `WRITE-FILE`).
In the test
```
HEX C3 EMIT KEY DROP A4 EMIT
```
we can see that after the first emit nothing is shown, and after the second emit the character __ä__ is shown.
,------------------------------------------
| 2021-04-04 15:57:41 ruv replies:
| proposal - EMIT and non-ASCII values
| see: https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-628
`------------------------------------------
> but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization.
Regardless of the encoding in the source code files, JavaScript [uses](https://flaviocopes.com/javascript-unicode/#how-javascript-uses-unicode-internally) UTF-16 encoding for strings. When you set UTF-8 — it's only about the encoding of the source code file/stream.
So the possible options to support extended characters beyond pchar are:
1. Define pchar size as 16 bits (an easy case), and use UTF-16. Also, use address units of 16 bits, or address units of 8 bits with environmental restrictions.
2. Define pchar size as 8 bits and handle UTF-8 by yourself (to convert it into UTF-16).
Another option is to not support extended characters beyond pchar at all. In this case pchar size can be 8, 16, or 32 bits.