It's not exactly that ncurses
is broken. More like, glibc
is broken. Or whatever implementation of libc
you are using; I'm just assuming that it is glibc
.
Unlike simple console output (i.e., printf
), ncurses
needs to know how wide every character is when it is printed because it needs to maintain its own model of what the screen looks like, and where the cursor is. Not all Unicode codepoints are 1 unit wide, even with a proportional font: many codepoints are zero units wide (combining accents, for example), and quite a few are two units wide (Han ideographs) [Note 1].
It turns out that there is a standard C library function, wcwidth
, which takes a wchar_t
and returns 0, 1, or 2 (or theoretically any integer, but afaik those are the only implemented widths) if the character is "printable", and -1 if the character is invalid or a control character. The wide-character-enabled version of ncurses
uses wcwidth
to predict how far the cursor will move after the character is printed. If wcwidth
returns the error indication, ncurses
substitutes a space.
wcwidth
reads the width from the WIDTH
section of the locale's charmap
, but that definition only provides the exceptions; any printable character without a defined width is assumed to have a width of 1. So wcwidth
also needs to check to see if the character is printable, which is defined in the LC_CTYPE
locale specification. That's the same data which drives the iswprint
library function.
Unfortunately, there is no guarantee that the terminal emulator shares the same view of Unicode character data as the C library functions. And for characters whose actual display widths are different from the locale-configured width, ncurses
will produce unexpected behaviour.
In this case, there's no problem with the width (the characters are all 1 unit wide, so the default is correct); the problem is that the characters actually exist in your console font and you want to use them, but they don't exist in glibc
's character database, because that database is still based on Unicode 5.0. (In fact, that bug itself should be updated, because Unicode is now at 6.3, not 6.1.)
To help you see that, here's a tiny little program which dumps the configured ctype information for unicode codepoints [Note 2]:
#define _XOPEN_SOURCE 600
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <wctype.h>
#include <wchar.h>
#define CONC_(x,y) x##y
#define IS(x) (CONC_(isw,x)(c)?#x" ":"")
int main(int argc, char** argv) {
setlocale(LC_CTYPE,"");
for (int i = 1; i < argc; ++i) {
wint_t c = strtoul(argv[i], NULL, 16);
printf("Code %04X: width %d %s%s%s%s%s%s%s%s%s%s%s%s
", c, wcwidth(c),
IS(alpha),IS(lower),IS(upper),IS(digit),IS(xdigit),IS(alnum),
IS(punct),IS(graph),IS(blank),IS(space),IS(print),IS(cntrl));
}
return 0;
}
Compile it you can look at your character data. It probably looks like this:
$ gcc -std=c11 -Wall -o wcinfo wcinfo.c
$ ./wcinfo 2603 26c4 1f638
Code 2603: width 1 punct graph print
Code 26C4: width -1
Code 1F638: width -1
So, what to do? You could wait for the glibc
database to get updated, but I suspect that's not going to happen anytime soon. So if you really want to use those characters, you'll need to modify your own locale definitions.
If you have the same glibc
installation as I do (and the locale files haven't changed for a while, so you probably do), then you'll find your locale files in /usr/share/i18n/locales
and in the actual locale file, the LC_CTYPE
section will include the directive copy "i18n"
, which means that the actual ctype configuration is in the file /usr/share/i18n/locales/i18n
. You can then edit that file to make appropriate changes. (Make a backup copy before you change the file, of course. And you'll need to sudo
your editor because the file is only writable by root.)
First find the line which starts graph
, [Note 3] and then search forwards for U26
(line 716 in my configuration, fwiw.) You'll find a line with an entry which looks like <U26A0>..<U26C3>;
, which means that codepoints 26A0
through 26C3
are graphical (visible printing) characters. Expand that range as necessary. (I changed the 26C3
to 26C4
for a minimal test, but you might want to include more characters.) A few lines further down, you'll see the second plane graph
ranges; add an appropriate entry. (Again, being minimalist, I added a new line:
<U0001F638>;/
but you'll probably want to include a range. (The trailing /
is the continuation marker, by the way.)
Next, go down a couple more lines, and you'll find the print
section. Make exactly the same changes.
Then you can regenerate your locale information by running:
$ sudo locale-gen
And then you can test:
$ ./wcinfo 2603 26c4 1f638
Code 2603: width 1 punct graph print
Code 26C4: width 1 graph print
Code 1F638: width 1 graph print
Once you do that, your original ncurses program should produce the expected output.
By the way, you can use wide character strings with ncurses; you don't have to manually produce UTF-8 encodings:
int
main (int argc, char *argv[])
{
WINDOW *stdscr;
setlocale (LC_ALL, "");
const wchar_t* wstr = L"<u2603u26c4U0001F638>";
stdscr = initscr ();
mvwaddwstr(stdscr, 0, 0, wstr);
getch ();
endwin ();
return 0;
}
Notes
For more information, see Wikipedia on halfwidth and fullwidth forms.
It's a quick-and-dirty no-error-checking program, but it's good enough for what we need here. For production purposes, one would want a few more lines of code :)
You might not need to fix the graph
wctype; print
might be sufficient. I didn't check. I did both because ncurses
also sometimes needs to know whether characters are transparent, and it seemed safer to mark the character as visible, since it is.