Proposal: [184] EMIT and non-ASCII values

Formal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl [184] EMIT and non-ASCII valuesProposal2021-04-03 15:34:40

Author:

Anton Ertl

Change Log:

2021-04-03 Original proposal

Problem:

The first ideas for the xchar wordset had EMIT behave like (current) XEMIT. Then Stephen Pelc pointed out that EMIT is used in a number of programs for dealing with raw bytes, so we introduced XEMIT for dealing with extended characters. But the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes. This is at odds with a number of implementations, and there is hardly any reason to keep both EMIT and XEMIT.

Solution:

Define EMIT to deal with raw bytes.

I leave a likewise proposal for KEY to interested parties.

Typical use: (Optional)

$c3 emit $a4 emit \ outputs ä on an UTF-8 system

Proposal:

Change the definition of EMIT into:

EMIT ( char -- )

Send char as raw byte to the user output device.

Rationale:

EMIT supports low-level communication of arbitrary contents, not limited to specific encodings; it corresponds to TYPEing one char/byte. To print multi-byte extended characters, the straightforward way is to use TYPE or XEMIT, but you can also print the individual bytes with multiple EMITs.

Reference implementation:

create emit-buf 1 allot

: emit ( char -- )
  emit-buf c! emit-buf 1 type ;

Existing practice

Gforth, SwiftForth, and VFX implement EMIT as dealing with raw bytes (tested with the "typical use" above), but Peter Fälth's system implements EMIT as an alias of XEMIT, and iForth prints two funny characters. It is unclear if there are any existing programs affected by the proposed change.

Testing:

This cannot be tested from a standard program, because there is no way to inspect the output of EMIT.

MitraArdronavatar of MitraArdron

I don't think this proposal works for extended characters. While $a4 emit works for ä this explicitly doesn't work (if my understanding is correct) for unicode positions that are multicharacter in UTF8. You have to know at the point of emitting, what the expected coding is.

If we define it as UTF8 then EMIT can know that the byte is part of a multi-byte character, and hold it until it gets the next byte before passing to the operating system, but at the moment I don't believe that Forth 2012 is defined as UTF8, so a conformant system would have to emit that first byte (which I think will have its top bit set) as a character.

For webForth in C (on Arduino) I feed the characters to Serial.write which (I think) treats it as UTF8, but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization. This is also a LOT faster than passing characters individually to a string oriented system anyway.

I'm not suggesting what I've done is the right solution - but I think any proposal understanding by someone with better understanding (than me, or the proposer of this) of how Unicode and UTF8 work before changes are made.

ruvavatar of ruv

the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes.

Yes, but 18.3.5 says: "IO words such as KEY, EMIT, [...] operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s)".

So if a system implements the optional Extended-Character word set then the argument of EMIT shall be pchar. I.e., it suggests that in such case EMIT cannot deal with extended characters (code points) that are not pchars. The problem is that if EMIT accepts code points that are not pchars then it cannot handle several calls with pchar as a single xchar, and vise versa — since pchar is a subset of xchar.

EMIT ( char -- )
Send char as raw byte to the user output device.

I would suggest to rely on primitive characters instead of bytes (since the Standard actually does not use "byte" notion):

EMIT ( pchar -- )
Send the primitive character pchar to the user output device.

Perhaps the primitive character data type (pchar) should be included into Table 3.1, since it's used in 3.1.2.3, and in A.6.1.1750. OTOH it's not clear what are the particular differences between the "primitive character" (pchar) and "character" (char) data types.

Existing practice

I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the console via chcp 65001 command), and in Linux. The test:

HEX  C3 EMIT A4 EMIT

outputs ä

In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).

In the test

HEX  C3 EMIT  KEY DROP  A4 EMIT

we can see that after the first emit nothing is shown, and after the second emit the character ä is shown.

ruvavatar of ruv

but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization.

Regardless of the encoding in the source code files, JavaScript uses UTF-16 encoding for strings. When you set UTF-8 — it's only about the encoding of the source code file/stream.

So the possible options to support extended characters beyond pchar are:

  1. Define pchar size as 16 bits (an easy case), and use UTF-16. Also, use address units of 16 bits, or address units of 8 bits with environmental restrictions.
  2. Define pchar size as 8 bits and handle UTF-8 by yourself (to convert it into UTF-16).

Another option is to not support extended characters beyond pchar at all. In this case pchar size can be 8, 16, or 32 bits.

MitraArdronavatar of MitraArdron

It would seem like a mistake to me to use 16 bit strings in Forth. Pretty much everything else seems to use UTF8, and it makes transition from existing Ascii MUCH easier, since the default case is to treat the string exactly as before.

As en example - for the javascript case in node, I use process.stdout.setEncoding('utf8'); to tell it that the output is UTF8, then TextDecoder().decode(this.buff8(byteStart, byteEnd - byteStart)) to turn Forth string to Javascript string then process.stdout.write(s) to output that string, this has the advantage of working just the same whether the string is old-style forth (1 byte per ascii character) or is UTF8.

If we used 16 bit strings, we'd need versions of all the string words - S", abort" etc for 16 bit rather than just fixing the output routines NOT to convert strings back to characters.

ruvavatar of ruv

It seems, to support characters beyond pchar in a JavaScript-based Forth implementation, we need to introduce own buffer before output to ensure that only completed xchars are passed to JS. And it's regardless of UTF-8 or UTF-16 is used in Forth.

Obviously, for better performance type should not be implemented over emit. But type should also check for uncompleted characters since s\" \xC3" type s\" \xA4" type should be equivalent to $C3 emit $A4 emit. So emit can be implemented over type without noticeable performance loss.

The word -TRAILING-GARBAGE can be used to separate the completed part from the last uncompleted xchar.

ruvavatar of ruv

Certainly, this buffering before output can be also implemented on the JS side. It should ensure that the last uncompleted xchar is not passed to decode(), but it's buffered to be concatenated with the next part.

StephenPelcavatar of StephenPelc

I support the idea that <b><f>EMIT and <b><f>KEY use pchars. Note that pchars are already defined in the standard.

Many Forth systems support redirectable I/O. It is almost impossible to guarantee that all comms channels handle xchars. In particular, both TCP/IP and USB may have breaks in the middle of UTF-8 characters.

AntonErtlavatar of AntonErtlNew Version: [184] EMIT and non-ASCII values

Hide differences

Author:

Anton Ertl

Change Log:

2021-04-03 Original proposal

2022-09-15 Better wording (also includes systems with address units >8 bits)

Problem:

The first ideas for the xchar wordset had EMIT behave like (current) XEMIT. Then Stephen Pelc pointed out that EMIT is used in a number of programs for dealing with raw bytes, so we introduced XEMIT for dealing with extended characters. But the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes. This is at odds with a number of implementations, and there is hardly any reason to keep both EMIT and XEMIT.

Solution:

Define EMIT to deal with raw bytes.

Define EMIT to deal with uninterpreted characters. Concerning systems with characters=address units larger than bytes, I would like to hear back from them if they need any more specific definition than what is proposed.

I leave a likewise proposal for KEY to interested parties.

Typical use: (Optional)

$c3 emit $a4 emit \ outputs ä on an UTF-8 system

Proposal:

Change the definition of EMIT into:

EMIT ( char -- )

Send char as raw byte to the user output device.

Send char to the user output device without interpreting it.

Add a reference to "18.6.1.2488.10 XEMIT" to the "See:" section.

Rationale:

EMIT supports low-level communication of arbitrary contents, not limited to specific encodings; it corresponds to TYPEing one char/byte. To print multi-byte extended characters, the straightforward way is to use TYPE or XEMIT, but you can also print the individual bytes with multiple EMITs.

Reference implementation:

create emit-buf 1 allot

: emit ( char -- )
  emit-buf c! emit-buf 1 type ;

Existing practice

Gforth, SwiftForth, and VFX implement EMIT as dealing with raw bytes (tested with the "typical use" above), but Peter Fälth's system implements EMIT as an alias of XEMIT, and iForth prints two funny characters. It is unclear if there are any existing programs affected by the proposed change.

Testing:

This cannot be tested from a standard program, because there is no way to inspect the output of EMIT.

AntonErtlavatar of AntonErtlNew Version: [184] EMIT and non-ASCII values

Hide differences

Author:

Anton Ertl

Change Log:

2021-04-03 Original proposal 2022-09-15 Better wording (also includes systems with address units >8 bits)

  • 2021-04-03 Original proposal
  • 2022-09-15 Better wording (also includes systems with address units >8 bits)
  • 2022-09-17 More explanation in the Rationale

Problem:

The first ideas for the xchar wordset had EMIT behave like (current) XEMIT. Then Stephen Pelc pointed out that EMIT is used in a number of programs for dealing with raw bytes, so we introduced XEMIT for dealing with extended characters. But the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes. This is at odds with a number of implementations, and there is hardly any reason to keep both EMIT and XEMIT.

Solution:

Define EMIT to deal with uninterpreted characters. Concerning systems with characters=address units larger than bytes, I would like to hear back from them if they need any more specific definition than what is proposed.

I leave a likewise proposal for KEY to interested parties.

Typical use: (Optional)

$c3 emit $a4 emit \ outputs ä on an UTF-8 system

Proposal:

Change the definition of EMIT into:

EMIT ( char -- )

Send char to the user output device without interpreting it.

Add a reference to "18.6.1.2488.10 XEMIT" to the "See:" section.

Rationale:

Add the following Rationale (as A.6.1.1320):

EMIT supports low-level communication of arbitrary contents, not limited to specific encodings; it corresponds to TYPEing one char/byte. To print multi-byte extended characters, the straightforward way is to use TYPE or XEMIT, but you can also print the individual bytes with multiple EMITs.

EMIT supports low-level communication of arbitrary contents, not limited to specific encodings; it corresponds to TYPEing one char (i.e. addr 1 type). In Unicode terminology, EMIT does not send a code point (there is XEMIT for that), but a code unit. To print multi-char extended characters, the straightforward way is to use TYPE or XEMIT, but you can also print the individual chars with multiple EMITs.

Add the following reference implementation as E.6.1.1320:

Reference implementation:

create emit-buf 1 allot

: emit ( char -- )
  emit-buf c! emit-buf 1 type ;

Existing practice

Gforth, SwiftForth, and VFX implement EMIT as dealing with raw bytes (tested with the "typical use" above), but Peter Fälth's system implements EMIT as an alias of XEMIT, and iForth prints two funny characters. It is unclear if there are any existing programs affected by the proposed change.

Testing:

This cannot be tested from a standard program, because there is no way to inspect the output of EMIT.

PeterFalthavatar of PeterFalth

" without interpreting it" will not be true on a Windows system. Windows works internally with UTF-16 so emit needs to buffer and translate to UTF-16 before sending the sequence to the screen. With the new "Windows Terminal" that will become the standard terminal for at least W11 this will change. The Windows Terminal has a VT-mode that makes it work like a UNIX terminal and with that UTF8 strings can be sent directly to the screen. Of course the translation is still there but hidden in the terminal.

If you really need to restrict EMIT just write that its input must be within 0-255. Then you need also to specify what happens if someone send for example $20ac to EMIT. Will it abort, just emit the low byte or maybe write a Euro sign on the screen?

Peter Fälth

AntonErtlavatar of AntonErtl

The idea is that it works as shown in the reference implementation and as described in Section "Typical Use". Several people who implement Forth on Windows were present in the committee meeting, and the idea of EMIT as dealing with raw bytes comes from one of them, so I expect that there is some way to implement the proposed EMIT on Windows. It does not matter if Windows, when displaying on the screen, first waits until it has a Unicode code point, converts it into UTF-16, and then uses its UTF-16 subsystem for displaying that. What matters is that binary data (including data that is not a valid code point according to the used encoding) sent through EMIT and redirected to somewhere is left unscathed by the Forth system and the OS. For the code-point display we have XEMIT.

It seems to me that this intent was perceived correctly by you (so the specification expresses the intent), but you think that it cannot be implemented on Windows.

Concerning restricting EMIT do 0-255: Systems with characters (and address units) larger than bytes may want to EMIT these larger characters (or not; the implementors of such systems have to figure out what is most useful in their situation), and the present proposal does not want to eliminated this option.

As for dealing with non-char inputs: In Forth-2012 EMIT is specified as taking an x (a cell), but the behaviour is standard-specified only for specific values and implementation-defined for the others. The common practice among the systems that emit raw bytes (Gforth, SwiftForth, VFX) is to ignore the upper bits. So

$1c3 emit $ffa4 emit

also prints "ä". I lean toward specifying that, maybe like "Upper bits in x that do not fit in a char are ignored".

UlrichHoffmannavatar of UlrichHoffmann

The committee decided to put this proposal in formal state. The author decides when to put it into community vote.

Formal
Reply New Version