A mini-brain has states. i.e. To compute a layer, all it need is previous
layers and the output of the mini-brain to the left.
Mini-brains can ask questions and share information.
“Backward and downward” mechanism, information only flows from left to right.
The only way to get around “downward” is the newly generated token will have a
chance to pass insights in high layers to future generated tokens — the basis
of chain-of-thought prompting.
Embedding
Models can undergo “contrastive learning” or “Siamese training”, in which we
train the model to produce identical embedding for two sentences with similar
meaning.
Encoder-only: the BERT way, bidirectional attention, sees all tokens at
once. Produces context-heavy representations.
Decoder-only: the GPT way, unidirectional attention. However, this is easier
to scale and train (self-supervised next-word prediction).
Many embeddings are specially fine-tuned LLM, tuned to produce a
representation.