Prompt Engineering for LLM

  • Prompting steps
    • Context retrieval
    • Snippetizing context
    • Scoring and prioritizing snippets (some may need to be dropped)
    • Prompt assembly
  • Tools like DSPy can be used to optimize prompt construction.

About the Prompt

  • truth-bias in the prompt content.
  • LLMs are all about completing a document.
  • Putting user content inside the system message will give users a chance to override the system message.
  • Criteria for prompt (for completion models)
    • Should be similar to texts that the LLM is trained on
    • Should contain all information needed to complete
    • Should lead to a solution
    • Should have a clear stop
  • Dos and don’ts
    • Prefer dos over don’ts
    • Give reason for instruction (thou shall not kill because…)
  • Few-shot prompting
    • Usually much easier than instruction based prompting, since LLMs are good at following examples. But this is also limited.
    • Does not scale well when context is big (long examples or too many examples)
    • And anchor the model in an unexpected way (biased), especially biasing towards edge cases (assume them to be as common as typical cases)
    • Can suggest spurious patterns, such as any sorting order — you never know what pattern is extrapolated by the LLM.
    • Try to make the model “believe” that it has solved a few of the problems successfully before.

Context

  • Latency matters. It’s best to pick as much context as we could then whittle them, thus context items should be comparable in terms of their value.
  • Brainstorm with mind map to find potential context items.
  • Two dimensions: proximity from user and stability and context
  • Irrelevant information should be avoided: Chekov’s Gun fallacy, LLM will try to reason hard to make sense of all info. Use RAG for context.
  • Summarization is needed when context is too long
    • Summarize summaries if content exceeds context window.
    • Recursive summaries: summarize at sections level, then at chapter level, then book level.
    • “Rumor problem”: model could misunderstand things in summarization.
    • Summarization is lossy, ask for summary with the final application task in mind. Specific summaries are good but can’t be shared among different use cases.

Assembly of Prompt

  • Constraints
    • In-context learning: the closer the information is to the end of the prompt, the more impact it has on the model.
    • The lost middle phenomenon: the model can easily recall the beginning and end of the prompt, but struggles with information in the middle.
  • Structure
    • Introduction: guiding the focus of the LLM from the very beginning.
    • Valley of Meh: the content in this valley are of reduced impact.
    • Context
    • Refocus: necessary for longer prompts to bring the model’s attention back to the question itself. e.g. “Based on the given information, I am ready to answer the question regarding…”
    • Transition: e.g. “The answer is…” In some model, this is implied by a question mark.
  • Chat vs Completion model
    • Chat model benefit from natural multi-round interactive problem solving.
    • Completion model avoids some unhelpful traits from RLHF, and allows inception, where we dictate the beginning of the answer.
  • Document types
    • Dialogues: freeform text, transcript, marker-less, structured.
    • Analytic Report: preferably in Markdown format, with an ## Idea monologue section that can be ignored (chain-of-thought prompting), the ## Conclusion section is the actual output, and ## Further Reading can be treated as a marker for end of response.
    • Structured Document: XML, YAML, JSON, etc.
  • Elastic snippet: given limited context window, create multiple versions of a context snippet, and place the biggest snippet that fit in into the final prompt.
  • Relationship between (sub) prompts
    • Position
    • Importance, assessed with scores or tiers.
    • Dependency, e.g. requirements and incompatibilities for snippets.
  • A prompt crafting engine: respects the constraints, uses some algorithm (e.g. additive/subtractive greedy algorithm) to pick snippets, then reconstruct the prompt according to the position.

Completion/Response

  • Preamble
    • Structural boilerplate: can be eliminated through prompting.
    • Reasoning: desirable with chain-of-thought prompting.
    • Fluff: should be avoided. e.g. “Please reply in the following format: 1. result 1, result 2, …, result n; 2. Disclaimers (if any); 3. Background and explanation (if any).”
  • Postscript: To detect the end of the actual answer, with the use of stop sequences and ending the stream.
  • Recognizing Start and End
  • Logprob: averaged logprobs is an indicator for confidence level of the response or quality.

Techniques

  • Chain-of-thought (CoT)
  • ReAct
  • Reflexion, run another analysis when applying the output of LLM. Can be traditional, can be with LLM (LLM-as-judge).
  • Agentic usage, including tool calling and reasoning.
  • Frameworks such as DSPy and TextGrad can be used to improve the prompts given I/O examples.

LLM as Classifier

  • When used as classifier, it’s important to make sure options all start with different tokens. Otherwise, the model will favor the options sharing common prefixes, as their logprobs add up.
  • Calibrate the model by shifting the logprob by a constant, if needed. For example, only answer No if it’s quite certain. The constant can be found by experimenting or by minimizing the cross entropy loss, as we do in logistic-regression.