What six LLM practitioners learned building real AI products

Executive overview

Most teams over-invest in fine-tuning and model-switching when prompt engineering, RAG, and evaluations already cover 80–90% of what they need. The blog series distilled here — written by six active LLM builders and published on O'Reilly — organises hard-won lessons into three layers: tactics (prompting and RAG), operations (model management), and strategy (building durable moats).

The core insight: master prompt engineering and RAG before touching anything else — fine-tuning is a very long last 10%, not a quick win.

Fine-tuning: when and whether to bother

Fine-tuning addresses only the last ~10% of quality — and that 10% is disproportionately hard to close.
Before pursuing it, exhaust alternatives: quality prompt engineering, a well-structured RAG pipeline, and iterative evals.
Ask three questions: Have I tried all alternatives? Does my use case require near-perfect consistency? Is the ROI large enough to justify the effort?
If you proceed, use synthetic data (model-generated) or open-source datasets to bootstrap training — human-labelled data is expensive and slow.

Prompt engineering tactics

Few-shot examples: include 5–24 examples in your system prompt. Below five risks over-fitting to the examples; above 24 risks overwhelming the model. Focus on ideal outputs, not input-output pairs.
Chain of thought: at minimum, append "think step by step". Better: provide a scratchpad via XML tags so the model reasons before answering. Best: specify exactly what to reason about in each scratchpad section (extract → verify → synthesise).
Delimiters: use structured boundaries (XML for Claude, JSON/Markdown for GPT) to separate distinct data sets passed to the model. Use Python libraries instructor (commercial models) or outlines (open-source) to enforce delimiter structure in multi-LLM pipelines.
Prompt chaining: break large tasks into small, single-objective steps — each handled by a separate LLM call. Example: one call extracts decisions and owners, a second verifies consistency, a third writes the summary. Each step feeds the next.

RAG and grounding

Grounding: instruct the model explicitly to consult the vector database first (or only) before answering. If no relevant data exists, it should say so rather than hallucinate. Best-in-class providers report a ~2% irreducible hallucination floor.
Hybrid search: combine semantic search (vector embeddings for synonyms, vague queries, misspellings) with keyword search (faster, cheaper, more precise for identifiers and proper names). The choice is "and", not "or".

Evaluations

Manual human review is too costly at scale; use an LLM-as-judge pattern instead — run evals in a batch process after responses are delivered to users, then feed results back to improve the model.
Key practices for reliable evals:
- Compare, don't score: ask the evaluator to pick the better of two answers rather than assign a numeric score — comparisons are measurably more accurate.
- Swap positions: run each comparison twice with A/B order reversed to neutralise position bias.
- Allow ties: if both answers are genuinely equal, the evaluator should say so.
- Chain of thought for the evaluator: requiring the judge to explain its reasoning improves accuracy and lets smaller, cheaper models match large-model quality.

Moving prompts between models

Migrating a prompt from one model to another (including version upgrades) can degrade output quality without warning.
Short-term fix — pinning: lock to a specific model version so provider updates don't affect you.
Long-term fix — shadowing: run the candidate model in a parallel test environment that receives a copy of every live request. Monitor both outputs over time; only switch when confidence is established.

Collecting user feedback

Feedback drives incremental model improvement; implicit signals are more abundant than explicit ones.
Implicit examples: accepting an autocomplete unchanged (strong positive), accepting then editing (mild positive), rejecting outright (negative); choosing one of two image variations in Midjourney; selecting one of two ChatGPT response options.
Explicit examples: thumbs up / thumbs down buttons. Useful but engagement is low — implicit signals should be the primary data source.

Downsizing and product-market fit

Start experiments with large commercial models to prove feasibility quickly.
Once product-market fit is confirmed, downsize to smaller commercial models or self-hosted open-source alternatives to reduce latency, cut costs, and improve data privacy.
Validate PMF before optimising infrastructure — premature downsizing wastes effort on features users may not want.

Building a strategic moat

Model-swapping is not a moat — the infrastructure built around models is. Four durable moat components:

Guardrails — infrastructure that prevents harmful, off-policy, or unsafe outputs.
Data flywheel — the feedback loop that continuously feeds user signals back into model improvement.
Caching — storing responses to common queries so repeat questions bypass the LLM entirely, saving time and money.
Evaluations — a robust, ongoing eval pipeline that compounds quality improvements over time.

What six LLM practitioners learned building real AI products

Executive overview

Fine-tuning: when and whether to bother

Prompt engineering tactics

RAG and grounding

Evaluations

Moving prompts between models

Collecting user feedback

Downsizing and product-market fit

Building a strategic moat

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Fine-tuning: when and whether to bother

Prompt engineering tactics

RAG and grounding

Evaluations

Moving prompts between models

Collecting user feedback

Downsizing and product-market fit

Building a strategic moat

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.