- Foreword
- Proposals Process
- 200x Membership
- 1 Introduction
- 2 Terms, notation, and references
- 3 Usage requirements
- 4 Documentation requirements
- 5 Compliance and labeling
- 6 Glossary
- 7 The optional Block word set
- 8 The optional Double-Number word set
- 9 The optional Exception word set
- 10 The optional Facility word set
- 11 The optional File-Access word set
- 12 The optional Floating-Point word set
- 13 The optional Locals word set
- 14 The optional Memory-Allocation word set
- 15 The optional Programming-Tools word set
- 16 The optional Search-Order word set
- 17 The optional String word set
- 18 The optional Extended-Character word set
- Annex A: Rationale
- Annex B: Bibliography
- Annex C: Compatibility analysis
- Annex D: Portability guide
- Annex E: Reference Implementations
- Annex F: Test Suite
- Annex H: Alphabetic list of words
18 The optional Extended-Character word set
18.1 Introduction
This word set deals with variable width character encodings. It also works with fixed width encodings.
Since the standard specifies ASCII encoding for characters, only ASCII-compatible encodings may be used. Because ASCII compatibility has so many benefits, most encodings actually are ASCII compatible. The characters beyond the ASCII encoding are called "extended characters" (xchars).
All words dealing with strings shall handle xchars when the xchar word set is present. This includes dictionary definitions. White space parsing does not have to treat code points greater than $20 as white space.
18.2 Additional terms and notation
18.2.1 Definition of Terms
- code point:
- A member of an extended character set.
18.2.2 Parsed-text notation
Append table 18.1 to table 2.1.
See: 2.2.3 Parsed-text notation.
18.3 Additional usage requirements
18.3.1 Data types
Append table 18.2 to table 3.1.
Symbol | Data type | Size on stack |
pchar | primitive character | 1 cell |
xchar | extended character | 1 cell |
xc-addr | xchar-aligned address | 1 cell |
18.3.1.1 Extended Characters
An extended character (xchar) is the code point of a character within an extended character set; on the stack it is a subset of u. Extended characters are stored in memory encoded as one or more primitive characters (pchars).
18.3.2 Environmental queries
Append table 18.3 to table 3.4.
String Value data type | Constant? | Meaning | |
XCHAR-ENCODING | c-addr u | no | Returns a printable ASCII string that represents the encoding,
and use the preferred MIME name (if any) or the name in the
IANA character-set register[1] (RFC-1700) such
as "ISO-LATIN-1 " or "UTF–8 ",
with the exception of "ASCII ", where the alias
"ASCII " is preferred. |
MAX-XCHAR | u | no | Maximal value for xchar |
XCHAR-MAXMEM | u | no | Maximal memory consumed by an xchar in address units |
18.3.3 Common encodings
Input and files are often encoded iso–latin–1 or utf–8. The encoding depends on settings of the computer system such as the LANG environment variable on Unix. You can use the system consistently only when you do not change the encoding, or only use the ASCII subset. The typical practice in environments requiring more than one encoding is that the base system is ASCII only, and the character set is then extended to specify the required encoding.
18.3.4 The Forth text interpreter
In section 3.4.1.3 Text interpreter input number conversion, <cnum> should be redefined to be:
<cnum> | the number is the value of <xchar> |
18.3.5 Input and Output
IO words such as KEY, EMIT, TYPE, READ-FILE, READ-LINE, WRITE-FILE, and WRITE-LINE operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s). The IO system shall combine these pchars into a complete xchars on output, or split an xchars into pchars on input, and shall not throw a "malformed xchars" exception when the combination of these pchars form a valid xchars. -TRAILING-GARBAGE can be used to process an incomplete xchars at the end of such an IO operation. ACCEPT as input editor may be aware of xchars to provide comfort like backspace or cursor movement.
18.4 Additional documentation requirements
18.4.1 System documentation
18.4.1.1 Implementation-defined options
Since Unicode input and display poses a number of challenges like input method editors for different languages, left-to-right and right-to-left writing, and most fonts contain only a subset of Unicode glyphs, systems should document their capabilities. File IO and in-memory string handling should work transparently with xchars.
18.4.1.2 Ambiguous conditions
- the data in memory does not encode a valid xchar (18.6.1.2486.50 X-SIZE);
- the xchars value is outside the range of allowed code points of the current character set used;
- words improperly used outside 6.1.0490 <# and 6.1.0040 #> (18.6.2.2488.20 XHOLD).
18.4.1.3 Other system documentation
- no additional requirements.
18.4.2 Program documentation
- no additional requirements.
18.5 Compliance and labeling
18.5.1 Forth-2012 systems
The phrase "Providing the Extended-Character word set" shall be appended to the label of any Standard System that provides all of the Extended-Character word set.The phrase "Providing name(s) from the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides portions of the Extended-Character Extensions word set.
The phrase "Providing the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides all of the Extended-Character and Extended-Character Extensions word sets.
18.5.2 Forth-2012 programs
The phrase "Requiring the Extended-Character word set" shall be appended to the label of Standard Programs that require the system to provide the Extended-Character word set.The phrase "Requiring name(s) from the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide portions of the Extended-Character Extensions word set.
The phrase "Requiring the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide all of the Extended-Character Exception and Extended-Character Extensions word sets.
18.6 Glossary
18.6.1 Extended-Character words
- 18.6.1.2486.50 X-SIZE
- 18.6.1.2487.10 XC!+
- 18.6.1.2487.15 XC!+?
- 18.6.1.2487.20 XC,
- 18.6.1.2487.25 XC-SIZE
- 18.6.1.2487.35 XC@+
- 18.6.1.2487.40 XCHAR+
- 18.6.1.2488.10 XEMIT
- 18.6.1.2488.30 XKEY
- 18.6.1.2488.35 XKEY?