What six LLM practitioners learned building real AI products

Executive overview

Most teams over-invest in fine-tuning and model-switching when prompt engineering, RAG, and evaluations already cover 80–90% of what they need. The blog series distilled here — written by six active LLM builders and published on O'Reilly — organises hard-won lessons into three layers: tactics (prompting and RAG), operations (model management), and strategy (building durable moats).

The core insight: master prompt engineering and RAG before touching anything else — fine-tuning is a very long last 10%, not a quick win.

Fine-tuning: when and whether to bother

  • Fine-tuning addresses only the last ~10% of quality — and that 10% is disproportionately hard to close.
  • Before pursuing it, exhaust alternatives: quality prompt engineering, a well-structured RAG pipeline, and iterative evals.
  • Ask three questions: Have I tried all alternatives? Does my use case require near-perfect consistency? Is the ROI large enough to justify the effort?
  • If you proceed, use synthetic data (model-generated) or open-source datasets to bootstrap training — human-labelled data is expensive and slow.

Prompt engineering tactics

  • Few-shot examples: include 5–24 examples in your system prompt. Below five risks over-fitting to the examples; above 24 risks overwhelming the model. Focus on ideal outputs, not input-output pairs.
  • Chain of thought: at minimum, append "think step by step". Better: provide a scratchpad via XML tags so the model reasons before answering. Best: specify exactly what to reason about in each scratchpad section (extract → verify → synthesise).
  • Delimiters: use structured boundaries (XML for Claude, JSON/Markdown for GPT) to separate distinct data sets passed to the model. Use Python libraries instructor (commercial models) or outlines (open-source) to enforce delimiter structure in multi-LLM pipelines.
  • Prompt chaining: break large tasks into small, single-objective steps — each handled by a separate LLM call. Example: one call extracts decisions and owners, a second verifies consistency, a third writes the summary. Each step feeds the next.

RAG and grounding

  • Grounding: instruct the model explicitly to consult the vector database first (or only) before answering. If no relevant data exists, it should say so rather than hallucinate. Best-in-class providers report a ~2% irreducible hallucination floor.
  • Hybrid search: combine semantic search (vector embeddings for synonyms, vague queries, misspellings) with keyword search (faster, cheaper, more precise for identifiers and proper names). The choice is "and", not "or".

Evaluations

  • Manual human review is too costly at scale; use an LLM-as-judge pattern instead — run evals in a batch process after responses are delivered to users, then feed results back to improve the model.
  • Key practices for reliable evals:
    • Compare, don't score: ask the evaluator to pick the better of two answers rather than assign a numeric score — comparisons are measurably more accurate.
    • Swap positions: run each comparison twice with A/B order reversed to neutralise position bias.
    • Allow ties: if both answers are genuinely equal, the evaluator should say so.
    • Chain of thought for the evaluator: requiring the judge to explain its reasoning improves accuracy and lets smaller, cheaper models match large-model quality.

Moving prompts between models

  • Migrating a prompt from one model to another (including version upgrades) can degrade output quality without warning.
  • Short-term fix — pinning: lock to a specific model version so provider updates don't affect you.
  • Long-term fix — shadowing: run the candidate model in a parallel test environment that receives a copy of every live request. Monitor both outputs over time; only switch when confidence is established.

Collecting user feedback

  • Feedback drives incremental model improvement; implicit signals are more abundant than explicit ones.
  • Implicit examples: accepting an autocomplete unchanged (strong positive), accepting then editing (mild positive), rejecting outright (negative); choosing one of two image variations in Midjourney; selecting one of two ChatGPT response options.
  • Explicit examples: thumbs up / thumbs down buttons. Useful but engagement is low — implicit signals should be the primary data source.

Downsizing and product-market fit

  • Start experiments with large commercial models to prove feasibility quickly.
  • Once product-market fit is confirmed, downsize to smaller commercial models or self-hosted open-source alternatives to reduce latency, cut costs, and improve data privacy.
  • Validate PMF before optimising infrastructure — premature downsizing wastes effort on features users may not want.

Building a strategic moat

Model-swapping is not a moat — the infrastructure built around models is. Four durable moat components:

  1. Guardrails — infrastructure that prevents harmful, off-policy, or unsafe outputs.
  2. Data flywheel — the feedback loop that continuously feeds user signals back into model improvement.
  3. Caching — storing responses to common queries so repeat questions bypass the LLM entirely, saving time and money.
  4. Evaluations — a robust, ongoing eval pipeline that compounds quality improvements over time.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.