How OpenAI builds production-ready AI agents

Executive overview

Most teams rush to build agents before they're needed. OpenAI's practical guide establishes clear gates for when agents are warranted, how to select models, and how to structure multi-agent systems safely.

Start with the simplest approach that works. Single agents handle the vast majority of real business use cases today. Guardrails and evals are non-negotiable for production quality.

The core differentiator between good and bad AI products is the quality of their evaluations.

What an agent actually is

An agent has instructions, makes independent decisions, and takes actions via tools
Three components: instructions (how to behave), decision-making, and tools (actions)
Not an agent: chatbots, single-turn LLM calls, classifiers, automation workflows with ChatGPT

When to use an agent

Default answer today: almost never — most business value comes from simpler AI without agents
Gate 1: decisions are complex, context-sensitive, and have many edge cases (e.g. approving refunds)
Gate 2: rules have become unwieldy — massive system prompts with branching if/else logic
Gate 3: use case relies heavily on unstructured data to drive decisions and actions (e.g. processing insurance claims)

Model selection

Start with the largest, most capable model to prove the use case works
Downsize to a cheaper, faster model only after hitting your quality target
Commercial models (GPT-4o mini, Gemini Flash 2.0) are now so cheap that self-hosted open source is rarely worth the overhead

Evals

Define what "good" looks like before building — this is the most commonly skipped step
Evals must be binary (pass/fail), not scored on a spectrum — spectrum scores are too ambiguous to act on
Alongside the binary score, record why it passed or failed
Over time, theme those failures and feed them back into the prompt systematically

Tools available to agents

Data tools: query CRM/databases, read documents, search the web
Action tools: send emails or texts, update records, hand off to a human
AI tools: delegate subtasks to specialist agents (orchestrator pattern)

Prompting agents effectively

Base prompts on existing company documents: SOPs, sales scripts, policies
Break documents into discrete tasks before writing instructions — don't dump the whole doc in
Define the specific action for each task step
Capture edge cases iteratively and feed them back into the prompt over time
Use AI to write the prompts — provide it the source document and instruct it to produce unambiguous, numbered directions for an agent

Single vs multi-agent

99% of agent use cases today should be single agents
Phase 1: single agent; Phase 2: dynamic prompts (swap context into a static template); Phase 3: multi-agent
Dynamic prompts (e.g. injecting customer name, tenure, complaint history) deliver significant value before any multi-agent complexity

When to go multi-agent

Logic overload: system prompt branches have become too complex to maintain or debug
Tool overload: tools overlap in function, causing the model to make the wrong choice between similar-looking options

Multi-agent structures

Manager (centralised): one orchestrator agent talks to the user; specialist sub-agents do the work and return results — best when a consistent persona matters
Decentralised (triage): a triage agent routes the conversation to specialist agents who then take over directly — best for high-volume, many-topic interactions where no single persona is needed

Guardrails

Place input and output guards between the user and the core agent
Input guards check: relevance (on-topic?), prompt injection detection, moderation (harmful content), rule-based filters (PII via regex, blacklists, context-window limits)
Output guards check: no PII leakage, brand/tone compliance
Tool-level safeguards: assign risk scores (low/medium/high) to tools; high-risk tools require higher user authorisation and additional checks before execution
Start small — implement the easiest guardrails first (PII, known injection patterns, relevance), then add edge cases iteratively
Balance security with user experience — over-restricting harms product usability

How OpenAI builds production-ready AI agents

Executive overview

What an agent actually is

When to use an agent

Model selection

Evals

Tools available to agents

Prompting agents effectively

Single vs multi-agent

When to go multi-agent

Multi-agent structures

Guardrails

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

What an agent actually is

When to use an agent

Model selection

Evals

Tools available to agents

Prompting agents effectively

Single vs multi-agent

When to go multi-agent

Multi-agent structures

Guardrails

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.