XC-WIDTH

( xchar -- n )

n is the number of monospace ASCII characters that take the same space to display as the xchar; i.e., xchar width is always an integer multiple of the width of an ASCII char.

Implementation:

: wc, ( n low high -- ) 1+ , , , ;

CREATE wc-table \ derived from wcwidth source code, for UCS32
0 0300 0357 wc,     0 035D 036F wc,     0 0483 0486 wc,
0 0488 0489 wc,     0 0591 05A1 wc,     0 05A3 05B9 wc,
0 05BB 05BD wc,     0 05BF 05BF wc,     0 05C1 05C2 wc,
0 05C4 05C4 wc,     0 0600 0603 wc,     0 0610 0615 wc,
0 064B 0658 wc,     0 0670 0670 wc,     0 06D6 06E4 wc,
0 06E7 06E8 wc,     0 06EA 06ED wc,     0 070F 070F wc,
0 0711 0711 wc,     0 0730 074A wc,     0 07A6 07B0 wc,
0 0901 0902 wc,     0 093C 093C wc,     0 0941 0948 wc,
0 094D 094D wc,     0 0951 0954 wc,     0 0962 0963 wc,
0 0981 0981 wc,     0 09BC 09BC wc,     0 09C1 09C4 wc,
0 09CD 09CD wc,     0 09E2 09E3 wc,     0 0A01 0A02 wc,
0 0A3C 0A3C wc,     0 0A41 0A42 wc,     0 0A47 0A48 wc,
0 0A4B 0A4D wc,     0 0A70 0A71 wc,     0 0A81 0A82 wc,
0 0ABC 0ABC wc,     0 0AC1 0AC5 wc,     0 0AC7 0AC8 wc,
0 0ACD 0ACD wc,     0 0AE2 0AE3 wc,     0 0B01 0B01 wc,
0 0B3C 0B3C wc,     0 0B3F 0B3F wc,     0 0B41 0B43 wc,
0 0B4D 0B4D wc,     0 0B56 0B56 wc,     0 0B82 0B82 wc,
0 0BC0 0BC0 wc,     0 0BCD 0BCD wc,     0 0C3E 0C40 wc,
0 0C46 0C48 wc,     0 0C4A 0C4D wc,     0 0C55 0C56 wc,
0 0CBC 0CBC wc,     0 0CBF 0CBF wc,     0 0CC6 0CC6 wc,
0 0CCC 0CCD wc,     0 0D41 0D43 wc,     0 0D4D 0D4D wc,
0 0DCA 0DCA wc,     0 0DD2 0DD4 wc,     0 0DD6 0DD6 wc,
0 0E31 0E31 wc,     0 0E34 0E3A wc,     0 0E47 0E4E wc,
0 0EB1 0EB1 wc,     0 0EB4 0EB9 wc,     0 0EBB 0EBC wc,
0 0EC8 0ECD wc,     0 0F18 0F19 wc,     0 0F35 0F35 wc,
0 0F37 0F37 wc,     0 0F39 0F39 wc,     0 0F71 0F7E wc,
0 0F80 0F84 wc,     0 0F86 0F87 wc,     0 0F90 0F97 wc,
0 0F99 0FBC wc,     0 0FC6 0FC6 wc,     0 102D 1030 wc,
0 1032 1032 wc,     0 1036 1037 wc,     0 1039 1039 wc,
0 1058 1059 wc,     1 0000 1100 wc,     2 1100 115f wc,
0 1160 11FF wc,     0 1712 1714 wc,     0 1732 1734 wc,
0 1752 1753 wc,     0 1772 1773 wc,     0 17B4 17B5 wc,
0 17B7 17BD wc,     0 17C6 17C6 wc,     0 17C9 17D3 wc,
0 17DD 17DD wc,     0 180B 180D wc,     0 18A9 18A9 wc,
0 1920 1922 wc,     0 1927 1928 wc,     0 1932 1932 wc,
0 1939 193B wc,     0 200B 200F wc,     0 202A 202E wc,
0 2060 2063 wc,     0 206A 206F wc,     0 20D0 20EA wc,
2 2329 232A wc,     0 302A 302F wc,     2 2E80 303E wc,
0 3099 309A wc,     2 3040 A4CF wc,     2 AC00 D7A3 wc,
2 F900 FAFF wc,     0 FB1E FB1E wc,     0 FE00 FE0F wc,
0 FE20 FE23 wc,     2 FE30 FE6F wc,     0 FEFF FEFF wc,
2 FF00 FF60 wc,     2 FFE0 FFE6 wc,     0 FFF9 FFFB wc,
0 1D167 1D169 wc,     0 1D173 1D182 wc,     0 1D185 1D18B wc,
0 1D1AA 1D1AD wc,     2 20000 2FFFD wc,     2 30000 3FFFD wc,
0 E0001 E0001 wc,     0 E0020 E007F wc,     0 E0100 E01EF wc,
HERE wc-table - CONSTANT #wc-table

\ inefficient table walk:

: XC-WIDTH ( xchar -- n )
   wc-table #wc-table OVER + SWAP ?DO
     DUP I 2@ WITHIN IF DROP I 2 CELLS + @ UNLOOP EXIT THEN
   3 CELLS +LOOP DROP 1 ;

Testing:

T{ $606D XC-WIDTH -> 2 }T
T{   $41 XC-WIDTH -> 1 }T
T{ $2060 XC-WIDTH -> 0 }T

ContributeContributions

PeterFalthavatar of PeterFalth Return value of XC-WIDTH for control charactersRequest for clarification2021-04-30 19:05:32

I have been updating my XC-WIDTH function and need to understand what the width should be for control characters ( 1 to $1F and $80 to $9F ). Markus Kuhn's original wcwidth returns -1 for them. I have seen that Julia language returns 0. The function above returns 1. My suggestion would be to return 0. This question also applies to X-WIDTH. I could provide an updated and more complete function than the above

ruvavatar of ruv

what the width should be for control characters ( 1 to $1F and $80 to $9F )

Whatever it should be, it should be specified — zero, one, or system dependent.

The space that a control character takes depends on the environment and context. E.g. Tab (0x9) can take from 1 to 8 em (depending on the position on the display, and the display properties). Bell (0x7) usually takes 0 em.

Obviously, a Forth system cannot return the actual width for control characters in all cases.

If a program knows the environment and wants to calculate the actual width of a string, it should implement its own version of X-WIDTH. And then it doesn't matter what XC-WIDTH returns for control characters. But what XC-WIDTH returns should be good enough as a fallback.

Perhaps a good enough approach is to suppose that the control characters don't have any "control" function and they are mapped to some real characters and displayed as usual characters. And then XC-WIDTH should return 1 for them.


A side note. In XML the most control characters are illegal due to unclear semantics in the context of XML (see useful comments at StackOverflow).

StephenPelcavatar of StephenPelc

An exhaustive definition of control characters for (say) UTF-8 will be a big job. See for example
https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf

If you can't do a proper job, then I suggest that we need an undefined/unknown return value, for which the obvious value is -1. This will certainly work for embedded systems.

StephenPelcavatar of StephenPelc

An exhaustive definition of control characters for (say) UTF-8 will be a big job. See for example
https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf

If you can't do a proper job, then I suggest that we need an undefined/unknown return value, for which the obvious value is -1. This will certainly work for embedded systems.

ruvavatar of ruv

I suggest that we need an undefined/unknown return value, for which the obvious value is -1.

It should be noted that the reference implementation for X-WIDTH is admissible if XC-WIDTH returns 1 or 0 for control codes, but it isn't admissible if XC-WIDTH returns -1 for control codes.

When I print a string with control codes in the console, all but several of them takes one em.

-1 is easily transformed to 1 via ABS.

But -1 can be an actual width for DEL (in that sense that it decreases the length of string by one). Then is it acceptable as a replacement for unknown values?

If a program handles some control characters, it knows these characters and it probably will not apply XC-WIDTH for this characters at all. If the program gets special "unknown" value from XC-WIDTH, what it can do with that?

So I'm not convinced that we really need a special code for "unknown". Probably it's enough to just some not too bad fallback value.

PeterFalthavatar of PeterFalth

I have done some more research of what other languages do. All use -1 except Julia that returns 0. Based on this I will implement -1 as the return value for the characters 01 to $1F and $80 to $9F. -1 will signify that a width can not be determined. As a consequence also X-WIDTH will return -1 if such a character is found in the string. The same will also apply for the range $D800-$DFFF , the surrogate range, these codes can never appear in a valid string

ruvavatar of ruv

X-WIDTH will return -1 if such a character is found in the string.

Actually, the value -1 (or other negative) semantically violates the current specification, since a negative "number of monospace ASCII characters" is a nonsense.

At the moment, a program may use something code like ... X-WIDTH BLANK or ... XC-WIDTH BLANK. Such a program is standard compliant, but it will fail on your system for some characters.

ruvavatar of ruv

If XC-WIDTH (or X-WIDTH) may return -1 as a special value, then in the most cases this word should be followed by if as:

   ... XC-WIDTH DUP -1 = IF DROP ( workaround ) ... ELSE ( use the width ) ... THEN

Perhaps a better way (and closer to Forth, where "functions" can return several values) was to introduce another word that also returns a flag.

Some possible variants:

  • XC-WIDTH? ( xchar -- u true | false ) — similar to ENVIRONMENT? ( c-addr u -- i*x true | false )
  • XC>WIDTH ( xchar -- u true | xchar false ) — similar to EKEY>CHAR ( x -- char true | x false ) and EKEY>FKEY ( x -- u true | x false )

PeterFalthavatar of PeterFalth

There are probably very few occasions where these words are really needed. One example is a command-line editor. I recently ported the editor from lxf to lxf64 and needed these redefinitions. lxf64 xc-width returns -1 for control chars

: xc-width0 xc-width 0 max ;`

: x-width0   ( addr u ) 
    over + >r 0 swap 
    begin dup r@ < while xc@+ xc-width0 rot + swap repeat 
    r> 2drop ;

I implemented the XCHAR wordset when it was discussed on CLF. I do not remember any discussion of the description of XC-WIDTH more then it was a word that should be in the standard. Now when emojis are available, easy to input and take 2 chars space we need to revisit these words.

Reply New Version