Token

Tokenizer

Rule based
- Whitespace tokenizer (split on spaces, etc)
- Punctuation tokenizer (split on both whitespaces and punctuations)
- Simple regex tokenizer, e.g. \W+
- NLTK tokenizers, with word_tokenize being the standard, based on Penn Treebank rules
- Stanford Tokenizer (now in Stanza)
- Penn Treebank (PTB) tokenizer, follows the convention in Penn Treebank corpus
- spaCy tokenizer, rules-based with statistical enhancements
- Some domain specific tokenizers, such as GATE for info extraction, Twitter tokenizer in NLTK.
Statistical/ML based
- WordPiece, subword, used by BERT
- Byte-Pair Encoding (BPE), used by GPT models, RoBERTa, Tiktoken is an implementation
- Unigram, used by XLNet and AlBERT