48
loading...
This website collects cookies to deliver better user experience
char
type we all know and love; with the assumption that a character will fit into a byte it is no longer adequate.wchar_t
and a bunch of functions to help dealing with non-ASCII encodings but I always found them unnecessarily complex and confusing.range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
0000 - 007F | 0xxxxxxx | |||
0080 - 07FF | 110xxxxx | 10xxxxxx | ||
0800 - FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
10000 - 10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
U+D800
to U+DFFF
(known as UTF-16 surrogates) are invalid and, hence, their encoding is invalid.11
(i.e. if (b & 0xC0) == 0xC0
the byte b
is the first byte of a multibyte encoding);10
(i.e. if (b & 0xC0) == 0x80
the byte b
is part of a multibyte encoding);NUL
character ('\0
') is introduced as byproduct of the encoding, meaning that our convention that a string is 0 terminated, is safe.,
', ';
', '
', ...)strcpy()
, strcmp()
, strstr()
, fgets()
, and any other function that relies on ASCII terminators (\0
, \n
, \t
, ...) are completely unaffected.strtok()
, strspn()
, strchr()
, will work as long as their other argument is within the ASCII range.strlen()
, strncpy()
, and other size limited functions, the n
parameter express the size (in bytes) of the buffer the string is in, not the number of character in the string.// Returns the number of characters in an UTF-8 encoded string.
// (Does not check for encoding validity)
int u8strlen(const char *s)
{
int len=0;
while (*s) {
if ((*s & 0xC0) != 0x80) len++ ;
s++;
}
return len;
}
// Avoids truncating multibyte UTF-8 encoding at the end.
char *u8strncpy(char *dest, const char *src, size_t n)
{
int k = n-1;
int i;
if (n) {
dest[k] = 0;
strncpy(dest,src,n);
if (dest[k] & 0x80) { // Last byte has been overwritten
for (i=k; (i>0) && ((k-i) < 3) && ((dest[i] & 0xC0) == 0x80); i--) ;
switch(k-i) {
case 0: dest[i] = '\0'; break;
case 1: if ( (dest[i] & 0xE0) != 0xC0) dest[i] = '\0'; break;
case 2: if ( (dest[i] & 0xF0) != 0xE0) dest[i] = '\0'; break;
case 3: if ( (dest[i] & 0xF8) != 0xF0) dest[i] = '\0'; break;
}
}
}
return dest;
}
When you're asked to deal with UTF-8 encoded strings in C, ask yourself what aspect of the encoding really impacts your work. You may discover that being UTF-8 encoded is immaterial for the work you have to do!