State-of-the-art prompting and evals for AI agents

Executive overview

Prompting for production AI agents is more structured and systematic than most builders realise. The best vertical AI companies treat prompts as engineering artifacts — long, detailed, XML-tagged — and treat evals as the true competitive moat.

The prompt is replaceable; the evals are not.

Anatomy of a production agent prompt

Role definition comes first: tell the model what it is and what it must decide.
Break the task into numbered steps; use XML tags for structured output.
Markdown-style headings help LLMs navigate long prompts — many were post-trained with XML-heavy input.
Specify output format explicitly so the agent integrates cleanly with downstream agents.
Give the model an escape hatch: instruct it to stop and ask rather than hallucinate when information is missing.
One implementation: add a debug_info output field where the model reports confusion or underspecified instructions — it becomes an automatic to-do list for the developer.

System, developer, and user prompt layers

System prompt: company-wide, model-agnostic, no customer-specific logic.
Developer prompt: customer-specific context stuffed in at runtime (e.g. how Perplexity handles refund queries vs. how Bolt does).
User prompt: end-user input, only present when the product is consumer-facing.
Forking and merging prompts across customers — deciding what is shared vs. customer-specific — is an open engineering problem.

Meta-prompting

Meta-prompting: feeding a prompt back into an LLM and asking it to improve itself.
Prompt folding: one prompt dynamically generates a better version of itself, often by ingesting examples where the previous version failed.
Practical starting point: give the LLM the role of "expert prompt engineer", paste your prompt, and iterate on the output.
Use a large model (e.g. o3, Gemini 2.5 Pro) to refine the prompt; deploy the refined prompt on a smaller, faster model — common pattern for latency-sensitive voice agents.
Gemini 2.5 Pro's thinking traces are a key debugging tool: watch the reasoning trace on a single example to understand where the prompt misfires.
Use Gemini as a REPL: drag and drop JSON files directly into gemini.google.com and observe reasoning live.

Worked examples and few-shot steering

When prose instructions fail on complex tasks, inject a worked example directly into the prompt.
Example: finding N+1 database queries requires showing the model an expert-annotated example, not a description.
This is the LLM analogue of test-driven development — examples act as unit tests that steer reasoning.
Automatically selecting the best examples from customer data and injecting them into the pipeline is a startup opportunity.

Model personalities and rubric handling

Claude: more human-steerable, cooperative by default.
Llama 4: responds better to heavy explicit steering; closer to working with a developer than an assistant.
o3: rigidly follows rubrics, heavily penalises exceptions.
Gemini 2.5 Pro: treats rubrics as guides, reasons through edge cases independently — useful when exceptions are common.
Match model to task: rigid scoring → o3; nuanced judgment → Gemini 2.5 Pro.

Evals as the real moat

ParaHelp open-sourced their prompt specifically because they consider evals — not prompts — to be the crown jewels.
Without evals, you cannot understand why a prompt was written the way it was or systematically improve it.
Building evals requires sitting next to the actual end user — the tractor sales manager in Nebraska, the FBI agent — and codifying their exact reward function.
This domain knowledge cannot be acquired remotely or at scale; it is the defensible asset.

The forward-deployed engineer model

Forward-deployed engineers (FDEs) originated at Palantir: send engineers — not salespeople — to sit inside client organisations and ship working software immediately.
The insight: Fortune 500s and government agencies had trillion-dollar data problems but no one with software expertise in the room.
FDEs win by showing a working demo on the second meeting instead of a 50-page proposal.
Founders of vertical AI companies today are the FDEs of their own products — they take context from client meetings, encode it into the prompt, and return with a demo the next day.
This compresses what Palantir did in weeks with a team of engineers into two founders closing seven-figure enterprise deals.
Examples: GigaML (voice support, Zepto), Happy Robot (voice agents for logistics brokers) — both used the FDE model to move from six- to seven-figure contracts.

Continuous improvement (Kaizen applied to prompts)

The people closest to the work are best placed to improve it — the same principle that produced Japan's automotive quality in the 1990s.
Note failures in plain language as they occur; feed notes plus the original prompt to a large model and ask for suggested edits.
Thinking traces surface the exact reasoning failures you cannot see from outputs alone.

State-of-the-art prompting and evals for AI agents

Executive overview

Anatomy of a production agent prompt

System, developer, and user prompt layers

Meta-prompting

Worked examples and few-shot steering

Model personalities and rubric handling

Evals as the real moat

The forward-deployed engineer model

Continuous improvement (Kaizen applied to prompts)

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Anatomy of a production agent prompt

System, developer, and user prompt layers

Meta-prompting

Worked examples and few-shot steering

Model personalities and rubric handling

Evals as the real moat

The forward-deployed engineer model

Continuous improvement (Kaizen applied to prompts)

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.