The original is one click away. Open original ↗
State-of-the-art prompting and evals for AI agents
Executive overview
Prompting for production AI agents is more structured and systematic than most builders realise. The best vertical AI companies treat prompts as engineering artifacts — long, detailed, XML-tagged — and treat evals as the true competitive moat.
The prompt is replaceable; the evals are not.
Anatomy of a production agent prompt
- Role definition comes first: tell the model what it is and what it must decide.
- Break the task into numbered steps; use XML tags for structured output.
- Markdown-style headings help LLMs navigate long prompts — many were post-trained with XML-heavy input.
- Specify output format explicitly so the agent integrates cleanly with downstream agents.
- Give the model an escape hatch: instruct it to stop and ask rather than hallucinate when information is missing.
- One implementation: add a
debug_infooutput field where the model reports confusion or underspecified instructions — it becomes an automatic to-do list for the developer.
System, developer, and user prompt layers
- System prompt: company-wide, model-agnostic, no customer-specific logic.
- Developer prompt: customer-specific context stuffed in at runtime (e.g. how Perplexity handles refund queries vs. how Bolt does).
- User prompt: end-user input, only present when the product is consumer-facing.
- Forking and merging prompts across customers — deciding what is shared vs. customer-specific — is an open engineering problem.
Meta-prompting
- Meta-prompting: feeding a prompt back into an LLM and asking it to improve itself.
- Prompt folding: one prompt dynamically generates a better version of itself, often by ingesting examples where the previous version failed.
- Practical starting point: give the LLM the role of "expert prompt engineer", paste your prompt, and iterate on the output.
- Use a large model (e.g. o3, Gemini 2.5 Pro) to refine the prompt; deploy the refined prompt on a smaller, faster model — common pattern for latency-sensitive voice agents.
- Gemini 2.5 Pro's thinking traces are a key debugging tool: watch the reasoning trace on a single example to understand where the prompt misfires.
- Use Gemini as a REPL: drag and drop JSON files directly into gemini.google.com and observe reasoning live.
Worked examples and few-shot steering
- When prose instructions fail on complex tasks, inject a worked example directly into the prompt.
- Example: finding N+1 database queries requires showing the model an expert-annotated example, not a description.
- This is the LLM analogue of test-driven development — examples act as unit tests that steer reasoning.
- Automatically selecting the best examples from customer data and injecting them into the pipeline is a startup opportunity.
Model personalities and rubric handling
- Claude: more human-steerable, cooperative by default.
- Llama 4: responds better to heavy explicit steering; closer to working with a developer than an assistant.
- o3: rigidly follows rubrics, heavily penalises exceptions.
- Gemini 2.5 Pro: treats rubrics as guides, reasons through edge cases independently — useful when exceptions are common.
- Match model to task: rigid scoring → o3; nuanced judgment → Gemini 2.5 Pro.
Evals as the real moat
- ParaHelp open-sourced their prompt specifically because they consider evals — not prompts — to be the crown jewels.
- Without evals, you cannot understand why a prompt was written the way it was or systematically improve it.
- Building evals requires sitting next to the actual end user — the tractor sales manager in Nebraska, the FBI agent — and codifying their exact reward function.
- This domain knowledge cannot be acquired remotely or at scale; it is the defensible asset.
The forward-deployed engineer model
- Forward-deployed engineers (FDEs) originated at Palantir: send engineers — not salespeople — to sit inside client organisations and ship working software immediately.
- The insight: Fortune 500s and government agencies had trillion-dollar data problems but no one with software expertise in the room.
- FDEs win by showing a working demo on the second meeting instead of a 50-page proposal.
- Founders of vertical AI companies today are the FDEs of their own products — they take context from client meetings, encode it into the prompt, and return with a demo the next day.
- This compresses what Palantir did in weeks with a team of engineers into two founders closing seven-figure enterprise deals.
- Examples: GigaML (voice support, Zepto), Happy Robot (voice agents for logistics brokers) — both used the FDE model to move from six- to seven-figure contracts.
Continuous improvement (Kaizen applied to prompts)
- The people closest to the work are best placed to improve it — the same principle that produced Japan's automotive quality in the 1990s.
- Note failures in plain language as they occur; feed notes plus the original prompt to a large model and ask for suggested edits.
- Thinking traces surface the exact reasoning failures you cannot see from outputs alone.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.