Token

Tokenizer

  • Rule based
    • Whitespace tokenizer (split on spaces, etc)
    • Punctuation tokenizer (split on both whitespaces and punctuations)
    • Simple regex tokenizer, e.g. \W+
    • NLTK tokenizers, with word_tokenize being the standard, based on Penn Treebank rules
    • Stanford Tokenizer (now in Stanza)
    • Penn Treebank (PTB) tokenizer, follows the convention in Penn Treebank corpus
    • spaCy tokenizer, rules-based with statistical enhancements
    • Some domain specific tokenizers, such as GATE for info extraction, Twitter tokenizer in NLTK.
  • Statistical/ML based
    • WordPiece, subword, used by BERT
    • Byte-Pair Encoding (BPE), used by GPT models, RoBERTa, Tiktoken is an implementation
    • Unigram, used by XLNet and AlBERT