Autoregression, takes a most likely next token, append it to the prompt, and
run it again to get a new token.
Sampling tokens from possible outputs
Temperature, 0 will cause a semi-deterministic output, 1 will uniformly
sample the choices according to probability distribution. Model output
deteriorates at high temperature since the gibberish formed a pattern.
Another way to look at LLM: not just auto-completers, but highly effective,
neural network powered classifier, at each token. LLM also performs better
when used in this way. (e.g. tool calling in agent)
seq2seq architecture, encoder + decoder + thought vector, recurrent design.
ChallengeL thought vector is fixed and finite.
[@bahdanauNeuralMachineTranslation2016] introduced preserving all the hidden
state vectors for encoder to “soft search”.
[@vaswaniAttentionAllYou2017] Attention is All You Need introduced
transformer architecture, removed recurrent circuitry.
[@radfordImprovingLanguageUnderstanding2018] proposed generative pre-trained
transformer - GPT architecture, basically transformer with encoder ripped
off. Pre-training on unlabelled text with fine-tuning for specific tasks
worked pretty well.
GPT-2 increased training set and model size, making it multitask learner.
GPT-3 saw another order-of-magnitude increase in model size and training set.
[@brownLanguageModelsAre2020] - language models are few shot learners, the
start of prompt-engineering.