Unicode

Unicode | WikiPedia
Each “number” is called a code point!
Random bytes can hardly be a valid UTF-8 escape sequence, so if a non-ASCII text can be decoded by UTF-8, it’s probably in UTF-8.
BOM = byte-order mark, the U+FEFF character, at the beginning of the encoded bytes to signify the endianness, since U+FFFE is not a valid code point. The use of BOM is discouraged for UTF-8
A “Unicode Sandwich”: decode as early as possible, encode as late as possible
Always be specific about the encoding!
Normalization
- NFC and NFD: NFC generates the shortest possible string, NFD does the opposite. NFC is recommended by W3C
- NFKC and NFKD: K stands for compatibility, where characters are converted into a preferred “compatibility decomposition”. These two methods distorts information.

My Vault