Unicode

  • Unicode | WikiPedia
  • Each “number” is called a code point!
  • Random bytes can hardly be a valid UTF-8 escape sequence, so if a non-ASCII text can be decoded by UTF-8, it’s probably in UTF-8.
  • BOM = byte-order mark, the U+FEFF character, at the beginning of the encoded bytes to signify the endianness, since U+FFFE is not a valid code point. The use of BOM is discouraged for UTF-8
  • A “Unicode Sandwich”: decode as early as possible, encode as late as possible
  • Always be specific about the encoding!
  • Normalization
    • NFC and NFD: NFC generates the shortest possible string, NFD does the opposite. NFC is recommended by W3C
    • NFKC and NFKD: K stands for compatibility, where characters are converted into a preferred “compatibility decomposition”. These two methods distorts information.