Vector Embeddings
Intro
- Recommendation Systems, NLP, Computer Vision, Gen AI, LLM, etc are all based
on vector embeddings.
- Items put into an embedding space.
- Making recommendations based on distance between vectors, e.g. cosine
distance
∥A∥∥B∥A⋅B=i∑nAi2i∑nBi2i=1∑nAiBi
- Examples of embedding dimensions: male-female, verb tense, country-capital
- Latent Space, aka Embedding Space
- Word Embedding Models
- Word2Vec
- GloVe
- BERT
- GPT
- VGGNet (image)
- GoogLeNet (image)
- Neural Networks to obtain embedding
Models
- Traditional methods
- One-hot encoding, high dimensionality
- Bag-of-Words (BoW), also high dimensionality
- TF-IDF
- N-grams
- Statistical models
- LSA/LSI, a pioneering
- pLSA
LDA, Latent Dirichlet Allocation This is a topic-modelling tool!
- Word Embeddings (may need pooling, either by averaging or TF-IDF weighted
averaging)
- Word2Vec
- GloVe
- FastText (extends Word2Vec, with support for subwords)
- Extends on word embeddings
- Doc2Vec (based on Word2Vec)
- Transformers
- MiniLM (inspired by BERT)
- USE (Universal Sentence Encoder)
- BERT (or DistilBERT) — not ideal for embedding
- RoBERTa
- OpenAI embedding