18.6.1.2486.50 X-SIZE XCHAR
( xc-addr u1 -- u2 )
u2 is the number of pchars used to encode the first xchar stored in the string xc-addr u1. To calculate the size of the xchar, only the bytes inside the buffer may be accessed. An ambiguous condition exists if the xchar is incomplete or malformed.
Implementation:
: X-SIZE ( xc-addr u1 -- u2 )
0= IF DROP 0 EXIT THEN
\ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
C@
DUP $80 U< IF DROP 1 EXIT THEN
DUP $c0 U< IF -77 THROW THEN
DUP $e0 U< IF DROP 2 EXIT THEN
DUP $f0 U< IF DROP 3 EXIT THEN
DUP $f8 U< IF DROP 4 EXIT THEN
DUP $fc U< IF DROP 5 EXIT THEN
DUP $fe U< IF DROP 6 EXIT THEN
-77 THROW ;
0= IF DROP 0 EXIT THEN
\ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
C@
DUP $80 U< IF DROP 1 EXIT THEN
DUP $c0 U< IF -77 THROW THEN
DUP $e0 U< IF DROP 2 EXIT THEN
DUP $f0 U< IF DROP 3 EXIT THEN
DUP $f8 U< IF DROP 4 EXIT THEN
DUP $fc U< IF DROP 5 EXIT THEN
DUP $fe U< IF DROP 6 EXIT THEN
-77 THROW ;
ContributeContributions
alextangent [28] UTF-8 and Unicode codepoint maximumRequest for clarification2017-02-14 19:53:17
U+10FFFF is the maximum Unicode codepoint, and anything above this is not valid UTF-8. The implementation of X-SIZE should therefore return a -77 Malformed xchar error for anything beyond UTF-8 F4 8F BF BF