Token
Tokenizer
- Rule based
- Whitespace tokenizer (split on spaces, etc)
- Punctuation tokenizer (split on both whitespaces and punctuations)
- Simple regex tokenizer, e.g.
\W+
- NLTK tokenizers, with
word_tokenize being the standard, based on Penn
Treebank rules
- Stanford Tokenizer (now in
Stanza)
- Penn Treebank (PTB) tokenizer, follows the convention in Penn Treebank
corpus
- spaCy tokenizer, rules-based with statistical enhancements
- Some domain specific tokenizers, such as GATE for info extraction, Twitter
tokenizer in NLTK.
- Statistical/ML based
- WordPiece, subword, used by BERT
- Byte-Pair Encoding (BPE), used by GPT models, RoBERTa, Tiktoken is an
implementation
- Unigram, used by XLNet and AlBERT