Fine-Tuning Process
- Model alignment - fine-tuning the base model so that it behaves as user intended.
- HHH - Helpful, honest, harmless fine-tuning.
- Supervised fine-tuning (SFT) model - fine-tuned on top of the base model using sample conversation between a person and an HHH assistant.
- SFT models generate completions, human-experts score/rank them. (all with internal knowledge)
- Reward model, tuned with reinforcement-learning techniques — take SFT model, train it over the ranking of SFT generated completions, to return a numerical value representing the reward. Only ranking is studies, hence LLM knows to be consistent with its own internal knowledge.
- RLHF model, starting from SFT model, generating completions, evaluated by reward model. Proximal policy optimization (PPO) is needed, RLHF can’t produce text significantly different from SFT model (cheating) to get high score
- Annotated
ChatMLtrains chat model (instead of instruct models),<|im_start|>and<|im_end|>, three roles:system,user,assistant,function
Techniques
- Full fine-tuning (continued pre-training), simply continues the training with new documents, all parameters are updated, computation intensive.
- LoRA (low-rank adaptation), trains a “diff” on key parameter matrices. Suitable for teaching the model a new distribution, how to interpret the prompt, what completions are expected, etc.
- With continued pre-training and LoRA, most of the static part of the prompt can be eliminated, including the few-shot examples.
- Soft prompting, combining prompting and machine learning to find the “state of mind” that is most likely to produce the desired outcome — also considered a type of fine tuning.
- PeFT
After fine-tuning, make sure the new prompt follow the “new way”, otherwise the model will tend to “forget” the fine-tuning.