Proposal: EMIT and non-ASCII values

Informal

This page is dedicated to discussing this specific proposal

ContributeContributions

AntonErtlavatar of AntonErtl EMIT and non-ASCII valuesProposal2021-04-03 15:34:40

Author:

Anton Ertl

Change Log:

2021-04-03 Original proposal

Problem:

The first ideas for the xchar wordset had EMIT behave like (current) XEMIT. Then Stephen Pelc pointed out that EMIT is used in a number of programs for dealing with raw bytes, so we introduced XEMIT for dealing with extended characters. But the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes. This is at odds with a number of implementations, and there is hardly any reason to keep both EMIT and XEMIT.

Solution:

Define EMIT to deal with raw bytes.

I leave a likewise proposal for KEY to interested parties.

Typical use: (Optional)

$c3 emit $a4 emit \ outputs ä on an UTF-8 system

Proposal:

Change the definition of EMIT into:

EMIT ( char -- )

Send char as raw byte to the user output device.

Rationale:

EMIT supports low-level communication of arbitrary contents, not limited to specific encodings; it corresponds to TYPEing one char/byte. To print multi-byte extended characters, the straightforward way is to use TYPE or XEMIT, but you can also print the individual bytes with multiple EMITs.

Reference implementation:

create emit-buf 1 allot

: emit ( char -- )
  emit-buf c! emit-buf 1 type ;

Existing practice

Gforth, SwiftForth, and VFX implement EMIT as dealing with raw bytes (tested with the "typical use" above), but Peter Fälth's system implements EMIT as an alias of XEMIT, and iForth prints two funny characters. It is unclear if there are any existing programs affected by the proposed change.

Testing:

This cannot be tested from a standard program, because there is no way to inspect the output of EMIT.

MitraArdronavatar of MitraArdron

I don't think this proposal works for extended characters. While $a4 emit works for ä this explicitly doesn't work (if my understanding is correct) for unicode positions that are multicharacter in UTF8. You have to know at the point of emitting, what the expected coding is.

If we define it as UTF8 then EMIT can know that the byte is part of a multi-byte character, and hold it until it gets the next byte before passing to the operating system, but at the moment I don't believe that Forth 2012 is defined as UTF8, so a conformant system would have to emit that first byte (which I think will have its top bit set) as a character.

For webForth in C (on Arduino) I feed the characters to Serial.write which (I think) treats it as UTF8, but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization. This is also a LOT faster than passing characters individually to a string oriented system anyway.

I'm not suggesting what I've done is the right solution - but I think any proposal understanding by someone with better understanding (than me, or the proposer of this) of how Unicode and UTF8 work before changes are made.

ruvavatar of ruv

the wording and stack effect of EMIT suggests that EMIT should deal with (possibly extended) characters rather than raw bytes.

Yes, but 18.3.5 says: "IO words such as KEY, EMIT, [...] operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s)".

So if a system implements the optional Extended-Character word set then the argument of EMIT shall be pchar. I.e., it suggests that in such case EMIT cannot deal with extended characters (code points) that are not pchars. The problem is that if EMIT accepts code points that are not pchars then it cannot handle several calls with pchar as a single xchar, and vise versa — since pchar is a subset of xchar.

EMIT ( char -- )
Send char as raw byte to the user output device.

I would suggest to rely on primitive characters instead of bytes (since the Standard actually does not use "byte" notion):

EMIT ( pchar -- )
Send the primitive character pchar to the user output device.

Perhaps the primitive character data type (pchar) should be included into Table 3.1, since it's used in 3.1.2.3, and in A.6.1.1750. OTOH it's not clear what are the particular differences between the "primitive character" (pchar) and "character" (char) data types.

Existing practice

I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the console via chcp 65001 command), and in Linux. The test:

HEX  C3 EMIT A4 EMIT

outputs ä

In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).

In the test

HEX  C3 EMIT  KEY DROP  A4 EMIT

we can see that after the first emit nothing is shown, and after the second emit the character ä is shown.

ruvavatar of ruv

but for webForth in Javascript I flip it around and the base primitive is TX!S which puts out a string - TYPE calls this directly, and EMIT passes a 1 character string, TX!S just passes it to the Javascript which is string oriented - I define the stream as UTF8 encoded at initialization.

Regardless of the encoding in the source code files, JavaScript uses UTF-16 encoding for strings. When you set UTF-8 — it's only about the encoding of the source code file/stream.

So the possible options to support extended characters beyond pchar are:

  1. Define pchar size as 16 bits (an easy case), and use UTF-16. Also, use address units of 16 bits, or address units of 8 bits with environmental restrictions.
  2. Define pchar size as 8 bits and handle UTF-8 by yourself (to convert it into UTF-16).

Another option is to not support extended characters beyond pchar at all. In this case pchar size can be 8, 16, or 32 bits.

MitraArdronavatar of MitraArdron

It would seem like a mistake to me to use 16 bit strings in Forth. Pretty much everything else seems to use UTF8, and it makes transition from existing Ascii MUCH easier, since the default case is to treat the string exactly as before.

As en example - for the javascript case in node, I use process.stdout.setEncoding('utf8'); to tell it that the output is UTF8, then TextDecoder().decode(this.buff8(byteStart, byteEnd - byteStart)) to turn Forth string to Javascript string then process.stdout.write(s) to output that string, this has the advantage of working just the same whether the string is old-style forth (1 byte per ascii character) or is UTF8.

If we used 16 bit strings, we'd need versions of all the string words - S", abort" etc for 16 bit rather than just fixing the output routines NOT to convert strings back to characters.

ruvavatar of ruv

It seems, to support characters beyond pchar in a JavaScript-based Forth implementation, we need to introduce own buffer before output to ensure that only completed xchars are passed to JS. And it's regardless of UTF-8 or UTF-16 is used in Forth.

Obviously, for better performance type should not be implemented over emit. But type should also check for uncompleted characters since s\" \xC3" type s\" \xA4" type should be equivalent to $C3 emit $A4 emit. So emit can be implemented over type without noticeable performance loss.

The word -TRAILING-GARBAGE can be used to separate the completed part from the last uncompleted xchar.

ruvavatar of ruv

Certainly, this buffering before output can be also implemented on the JS side. It should ensure that the last uncompleted xchar is not passed to decode(), but it's buffered to be concatenated with the next part.

StephenPelcavatar of StephenPelc

I support the idea that EMIT and KEY use pchars. Note that pchars are already defined in the standard.

Many Forth systems support redirectable I/O. It is almost impossible to guarantee that all comms channels handle xchars. In particular, both TCP/IP and USB may have breaks in the middle of UTF-8 characters.

Reply New Version