X-SIZE

( xc-addr u1 -- u2 )

u2 is the number of pchars used to encode the first xchar stored in the string xc-addr u1. To calculate the size of the xchar, only the bytes inside the buffer may be accessed. An ambiguous condition exists if the xchar is incomplete or malformed.

Implementation:

: X-SIZE ( xc-addr u1 -- u2 )
   0= IF DROP 0 EXIT THEN
   \ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
   C@
   DUP $80 U< IF DROP 1 EXIT THEN
   DUP $c0 U< IF -77 THROW THEN
   DUP $e0 U< IF DROP 2 EXIT THEN
   DUP $f0 U< IF DROP 3 EXIT THEN
   DUP $f8 U< IF DROP 4 EXIT THEN
   DUP $fc U< IF DROP 5 EXIT THEN
   DUP $fe U< IF DROP 6 EXIT THEN
   -77 THROW ;

ContributeContributions

alextangentavatar of alextangent UTF-8 and Unicode codepoint maximumRequest for clarification2017-02-14 19:53:17

U+10FFFF is the maximum Unicode codepoint, and anything above this is not valid UTF-8. The implementation of X-SIZE should therefore return a -77 Malformed xchar error for anything beyond UTF-8 F4 8F BF BF

BerndPaysanavatar of BerndPaysan 2017-02-15 01:27:46

Yes, though that's an artefact of UTF-16, and it might be possible that future Unicode standards lift that limitation (when they run out of code points... and deprecate UTF-16 or provide a means to expand it's code point range). With the current Unicode standard, a -77 throw for code points above $10FFFF is a correct and high quality implementation.

Note that the Posted principle suggests that you accept slightly wrong data, but you shall not produce it, so it's more important to throw on XC! + and XEMIT.

Reply