The original is one click away. Open original ↗
What six LLM practitioners learned building real AI products
Executive overview
Most teams over-invest in fine-tuning and model-switching when prompt engineering, RAG, and evaluations already cover 80–90% of what they need. The blog series distilled here — written by six active LLM builders and published on O'Reilly — organises hard-won lessons into three layers: tactics (prompting and RAG), operations (model management), and strategy (building durable moats).
The core insight: master prompt engineering and RAG before touching anything else — fine-tuning is a very long last 10%, not a quick win.
Fine-tuning: when and whether to bother
- Fine-tuning addresses only the last ~10% of quality — and that 10% is disproportionately hard to close.
- Before pursuing it, exhaust alternatives: quality prompt engineering, a well-structured RAG pipeline, and iterative evals.
- Ask three questions: Have I tried all alternatives? Does my use case require near-perfect consistency? Is the ROI large enough to justify the effort?
- If you proceed, use synthetic data (model-generated) or open-source datasets to bootstrap training — human-labelled data is expensive and slow.
Prompt engineering tactics
- Few-shot examples: include 5–24 examples in your system prompt. Below five risks over-fitting to the examples; above 24 risks overwhelming the model. Focus on ideal outputs, not input-output pairs.
- Chain of thought: at minimum, append "think step by step". Better: provide a scratchpad via XML tags so the model reasons before answering. Best: specify exactly what to reason about in each scratchpad section (extract → verify → synthesise).
- Delimiters: use structured boundaries (XML for Claude, JSON/Markdown for GPT) to separate distinct data sets passed to the model. Use Python libraries
instructor(commercial models) oroutlines(open-source) to enforce delimiter structure in multi-LLM pipelines. - Prompt chaining: break large tasks into small, single-objective steps — each handled by a separate LLM call. Example: one call extracts decisions and owners, a second verifies consistency, a third writes the summary. Each step feeds the next.
RAG and grounding
- Grounding: instruct the model explicitly to consult the vector database first (or only) before answering. If no relevant data exists, it should say so rather than hallucinate. Best-in-class providers report a ~2% irreducible hallucination floor.
- Hybrid search: combine semantic search (vector embeddings for synonyms, vague queries, misspellings) with keyword search (faster, cheaper, more precise for identifiers and proper names). The choice is "and", not "or".
Evaluations
- Manual human review is too costly at scale; use an LLM-as-judge pattern instead — run evals in a batch process after responses are delivered to users, then feed results back to improve the model.
- Key practices for reliable evals:
- Compare, don't score: ask the evaluator to pick the better of two answers rather than assign a numeric score — comparisons are measurably more accurate.
- Swap positions: run each comparison twice with A/B order reversed to neutralise position bias.
- Allow ties: if both answers are genuinely equal, the evaluator should say so.
- Chain of thought for the evaluator: requiring the judge to explain its reasoning improves accuracy and lets smaller, cheaper models match large-model quality.
Moving prompts between models
- Migrating a prompt from one model to another (including version upgrades) can degrade output quality without warning.
- Short-term fix — pinning: lock to a specific model version so provider updates don't affect you.
- Long-term fix — shadowing: run the candidate model in a parallel test environment that receives a copy of every live request. Monitor both outputs over time; only switch when confidence is established.
Collecting user feedback
- Feedback drives incremental model improvement; implicit signals are more abundant than explicit ones.
- Implicit examples: accepting an autocomplete unchanged (strong positive), accepting then editing (mild positive), rejecting outright (negative); choosing one of two image variations in Midjourney; selecting one of two ChatGPT response options.
- Explicit examples: thumbs up / thumbs down buttons. Useful but engagement is low — implicit signals should be the primary data source.
Downsizing and product-market fit
- Start experiments with large commercial models to prove feasibility quickly.
- Once product-market fit is confirmed, downsize to smaller commercial models or self-hosted open-source alternatives to reduce latency, cut costs, and improve data privacy.
- Validate PMF before optimising infrastructure — premature downsizing wastes effort on features users may not want.
Building a strategic moat
Model-swapping is not a moat — the infrastructure built around models is. Four durable moat components:
- Guardrails — infrastructure that prevents harmful, off-policy, or unsafe outputs.
- Data flywheel — the feedback loop that continuously feeds user signals back into model improvement.
- Caching — storing responses to common queries so repeat questions bypass the LLM entirely, saving time and money.
- Evaluations — a robust, ongoing eval pipeline that compounds quality improvements over time.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.