utf 8 - How many characters can be mapped with Unicode?

Question

Welcome To Ask or Share your Answers For Others

utf 8 - How many characters can be mapped with Unicode?

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:10:07+0000

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 code points are actually assigned in Unicode 12.1.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ? is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ?.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

Categories

utf 8 - How many characters can be mapped with Unicode?

utf 8 - How many characters can be mapped with Unicode?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags