character encoding - Why Unicode is restricted to 0x10FFFF?

Question

Welcome To Ask or Share your Answers For Others

character encoding - Why Unicode is restricted to 0x10FFFF?

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:06:51+0000

It's because of UTF-16. Characters outside of the base multilingual plane (BMP) are represented using a surrogate pair in UTF-16 with the first code unit (CU) lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×2¹⁶ characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)

Therefore the total number of characters is 17×2¹⁶ = 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF

That's guaranteed by Unicode Character Encoding Stability Policies that a code point above that will never be assigned

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History

The same has been applied to UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32

You can read this more detailed answer and

Categories

character encoding - Why Unicode is restricted to 0x10FFFF?

character encoding - Why Unicode is restricted to 0x10FFFF?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags