Transformer Architecture

A “mini-brain” sitting on each token.
A mini-brain can pass information to its right.
A mini-brain has states. i.e. To compute a layer, all it need is previous layers and the output of the mini-brain to the left.
Mini-brains can ask questions and share information.
“Backward and downward” mechanism, information only flows from left to right.
The only way to get around “downward” is the newly generated token will have a chance to pass insights in high layers to future generated tokens — the basis of chain-of-thought prompting.

Embedding

Models can undergo “contrastive learning” or “Siamese training”, in which we train the model to produce identical embedding for two sentences with similar meaning.
Encoder-only: the BERT way, bidirectional attention, sees all tokens at once. Produces context-heavy representations.
Decoder-only: the GPT way, unidirectional attention. However, this is easier to scale and train (self-supervised next-word prediction).
Many embeddings are specially fine-tuned LLM, tuned to produce a representation.