Large Language Model

Concepts

  • Prompt engineering
  • Tokens used by a model: vocabulary
  • Autoregression, takes a most likely next token, append it to the prompt, and run it again to get a new token.
  • Sampling tokens from possible outputs
  • Temperature, 0 will cause a semi-deterministic output, 1 will uniformly sample the choices according to probability distribution. Model output deteriorates at high temperature since the gibberish formed a pattern.
  • Transformer architecture
  • Fine-tuning process
  • Agent
  • Another way to look at LLM: not just auto-completers, but highly effective, neural network powered classifier, at each token. LLM also performs better when used in this way. (e.g. tool calling in agent)

History

  • Markov model of natural language introduced by Shannon in 1948. A Mathematical Theory of Communication, which also introduced the concept of information entropy.
  • seq2seq architecture, encoder + decoder + thought vector, recurrent design. ChallengeL thought vector is fixed and finite.
  • [@bahdanauNeuralMachineTranslation2016] introduced preserving all the hidden state vectors for encoder to “soft search”.
  • [@vaswaniAttentionAllYou2017] Attention is All You Need introduced transformer architecture, removed recurrent circuitry.
  • [@radfordImprovingLanguageUnderstanding2018] proposed generative pre-trained transformer - GPT architecture, basically transformer with encoder ripped off. Pre-training on unlabelled text with fine-tuning for specific tasks worked pretty well.
  • GPT-2 increased training set and model size, making it multitask learner.
  • GPT-3 saw another order-of-magnitude increase in model size and training set. [@brownLanguageModelsAre2020] - language models are few shot learners, the start of prompt-engineering.

Resources