I'm assuming you want to know about CPython, the standard implementation. Python 2 and Python 3.0-3.2 use either UCS2* or UCS4 for Unicode characters, meaning it'll either use 2 bytes or 4 bytes for each character. Which one is picked is a compile-time option.
u2049
is then represented as either x49x20
or x20x49
or x49x20x00x00
or x00x00x20x49
depending on the native byte order of your system and if UCS2 or UCS4 was picked. ASCII characters in a unicode string still use 2 or 4 bytes per character too.
Python 3.3 switched to a new internal representation, using the most compact form needed to represent all characters in a string. Either 1 byte, 2 bytes or 4 bytes are picked. ASCII and Latin-1 text uses just 1 byte per character, the rest of the BMP characters require 2 bytes and after that 4 bytes is used.
See PEP-393: Flexible String Representation for the full low-down on these representations.
* Technically speaking the UCS-2 build uses UTF-16, as non-BMP characters use UTF-16 surrogates to encode to 4 bytes (2 UTF-16 characters) each. However, Python documentation still refers to this as UCS2.
This does lead to unexpected behaviour such as the len()
on non-BMP unicode strings being longer than the number of characters contained.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…