35
loading...
This website collects cookies to deliver better user experience
utf8
and utf8mb4
and why mixed collations are bad for your database performance, read on!A character set is a set of characters [and its mapping to binary] while a collation is the rules for comparing and sorting a particular character set.
– MariaDB: Character Set and Collation Overview
T
(LATIN CAPITAL LETTER T), in binary looks like 1010100
. The computer reads this binary sequence and knows that this is 84 in decimal. Now it looks up which character maps to this number. Using the UTF-8 character set, the number 84 equals T
.| |
| | |
| | | |
___ _%%%_
\,-' '_| \___/ hjm
/""----'
utf8
has been a crippled and proprietary variation as it neither conforms to the old nor the new definition (RFC 3629) of UTF-8. The misleading name still causes issues today.This is probably one of most expensive single char commits in world...
– morphles, Nov 25, 2021
utf8
character set only allows 3 bytes, you can't store some special characters in a database that utilizes this character set. Unfortunately, no one knows who made the change as all names were lost when moving the repository from BitKeeper to GitHub.utf8mb4
charset in version 5.5.3. utf8mb4
fully implements the current standard. Now utf8
is an alias for utf8mb3
and will be switched to utf8mb4
.MariaDB [(none)]> SHOW COLLATION;
+------------------------------+----------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------------------+----------+------+---------+----------+---------+
| utf8mb4_general_ci | utf8mb4 | 45 | Yes | Yes | 1 |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 |
+------------------------------+----------+------+---------+----------+---------+
Suffix | Meaning |
---|---|
ci |
case-insensitive |
cs |
case-sensitive |
bin
stands for binary
and sorts data by its binary notation and does not consider any language-specific rules.general
honors some rules but uses a simplified algorithm favoring speed over accuracy.unicode
or its versioned variants like unicode_520
use the official UCS (Universal Coded Character Set) algorithms. Unicode collations provide the most accurate sorting.utf8mb4_general_ci
does not knwo how to sort s
and the German character ß
("sharp S") in contrast to utf8mb4_unicode_520_ci
which sorts ß
just fine (ß
comes after s
):MariaDB [(none)]> SELECT 's' < 'ß' COLLATE utf8mb4_general_ci;
+---------------------------------------+
| 's' < 'ß' COLLATE utf8mb4_general_ci |
+---------------------------------------+
| 0 |
+---------------------------------------+
1 row in set (0.000 sec)
MariaDB [(none)]> SELECT 's' > 'ß' COLLATE utf8mb4_general_ci;
+---------------------------------------+
| 's' > 'ß' COLLATE utf8mb4_general_ci |
+---------------------------------------+
| 0 |
+---------------------------------------+
1 row in set (0.000 sec)
MariaDB [(none)]> SELECT 's' < 'ß' COLLATE utf8mb4_unicode_520_ci;
+-------------------------------------------+
| 's' < 'ß' COLLATE utf8mb4_unicode_520_ci |
+-------------------------------------------+
| 1 |
+-------------------------------------------+
1 row in set (0.001 sec)
MariaDB [(none)]> SELECT 's' > 'ß' COLLATE utf8mb4_unicode_520_ci;
+-------------------------------------------+
| 's' > 'ß' COLLATE utf8mb4_unicode_520_ci |
+-------------------------------------------+
| 0 |
+-------------------------------------------+
1 row in set (0.000 sec)
utf8mb4_unicode_ci
is based on UCS 4.0, which is very old. Even the newer utf8mb4_unicode_520_ci
(UCS 5.2.0) is more than ten years old. But newer collations are already discussed, and we may see collations based on UCS 14 in version 10.8 of MariaDB. MySQL already implemented a few collations based on more recent UCS versions.utf8mb4_german2_ci
, but I have never used them personally. I recommend sticking to the unicode
versions as they fit most use cases.unicode
variants should be slower than the general
variant due to the more complex algorithms, but this was years ago when computers were much weaker than now. With modern hardware, the speed gain should be hardly notable.utf8
and utf8mb4
and what collation to use. There is no reason to use utf8/utf8mb3
, and I believe everyone who chose utf8
expected to get the real UTF-8 (utf8mb4
), not a crippled version. So to use it, you have to specify utf8mb4
explicitly.utf8mb4_general_ci
with modern hardware.