Transformer Architecture

  • A “mini-brain” sitting on each token.
  • A mini-brain can pass information to its right.
  • A mini-brain has states. i.e. To compute a layer, all it need is previous layers and the output of the mini-brain to the left.
  • Mini-brains can ask questions and share information.
  • “Backward and downward” mechanism, information only flows from left to right.
  • The only way to get around “downward” is the newly generated token will have a chance to pass insights in high layers to future generated tokens — the basis of chain-of-thought prompting.

Embedding

  • Models can undergo “contrastive learning” or “Siamese training”, in which we train the model to produce identical embedding for two sentences with similar meaning.
  • Encoder-only: the BERT way, bidirectional attention, sees all tokens at once. Produces context-heavy representations.
  • Decoder-only: the GPT way, unidirectional attention. However, this is easier to scale and train (self-supervised next-word prediction).
  • Many embeddings are specially fine-tuned LLM, tuned to produce a representation.