Vector Embeddings

Intro

  • Recommendation Systems, NLP, Computer Vision, Gen AI, LLM, etc are all based on vector embeddings.
  • Items put into an embedding space.
  • Making recommendations based on distance between vectors, e.g. cosine distance
  • Examples of embedding dimensions: male-female, verb tense, country-capital
  • Latent Space, aka Embedding Space
  • Word Embedding Models
    • Word2Vec
    • GloVe
    • BERT
    • GPT
    • VGGNet (image)
    • GoogLeNet (image)
  • Neural Networks to obtain embedding

Models

  • Traditional methods
    • One-hot encoding, high dimensionality
    • Bag-of-Words (BoW), also high dimensionality
    • TF-IDF
    • N-grams
  • Statistical models
    • LSA/LSI, a pioneering
    • pLSA
    • LDA, Latent Dirichlet Allocation This is a topic-modelling tool!
  • Word Embeddings (may need pooling, either by averaging or TF-IDF weighted averaging)
    • Word2Vec
    • GloVe
    • FastText (extends Word2Vec, with support for subwords)
  • Extends on word embeddings
    • Doc2Vec (based on Word2Vec)
  • Transformers
    • MiniLM (inspired by BERT)
    • USE (Universal Sentence Encoder)
    • BERT (or DistilBERT) — not ideal for embedding
    • RoBERTa
    • OpenAI embedding