How OpenAI builds production-ready AI agents

Executive overview

Most teams rush to build agents before they're needed. OpenAI's practical guide establishes clear gates for when agents are warranted, how to select models, and how to structure multi-agent systems safely.

Start with the simplest approach that works. Single agents handle the vast majority of real business use cases today. Guardrails and evals are non-negotiable for production quality.

The core differentiator between good and bad AI products is the quality of their evaluations.

What an agent actually is

  • An agent has instructions, makes independent decisions, and takes actions via tools
  • Three components: instructions (how to behave), decision-making, and tools (actions)
  • Not an agent: chatbots, single-turn LLM calls, classifiers, automation workflows with ChatGPT

When to use an agent

  • Default answer today: almost never — most business value comes from simpler AI without agents
  • Gate 1: decisions are complex, context-sensitive, and have many edge cases (e.g. approving refunds)
  • Gate 2: rules have become unwieldy — massive system prompts with branching if/else logic
  • Gate 3: use case relies heavily on unstructured data to drive decisions and actions (e.g. processing insurance claims)

Model selection

  • Start with the largest, most capable model to prove the use case works
  • Downsize to a cheaper, faster model only after hitting your quality target
  • Commercial models (GPT-4o mini, Gemini Flash 2.0) are now so cheap that self-hosted open source is rarely worth the overhead

Evals

  • Define what "good" looks like before building — this is the most commonly skipped step
  • Evals must be binary (pass/fail), not scored on a spectrum — spectrum scores are too ambiguous to act on
  • Alongside the binary score, record why it passed or failed
  • Over time, theme those failures and feed them back into the prompt systematically

Tools available to agents

  • Data tools: query CRM/databases, read documents, search the web
  • Action tools: send emails or texts, update records, hand off to a human
  • AI tools: delegate subtasks to specialist agents (orchestrator pattern)

Prompting agents effectively

  • Base prompts on existing company documents: SOPs, sales scripts, policies
  • Break documents into discrete tasks before writing instructions — don't dump the whole doc in
  • Define the specific action for each task step
  • Capture edge cases iteratively and feed them back into the prompt over time
  • Use AI to write the prompts — provide it the source document and instruct it to produce unambiguous, numbered directions for an agent

Single vs multi-agent

  • 99% of agent use cases today should be single agents
  • Phase 1: single agent; Phase 2: dynamic prompts (swap context into a static template); Phase 3: multi-agent
  • Dynamic prompts (e.g. injecting customer name, tenure, complaint history) deliver significant value before any multi-agent complexity

When to go multi-agent

  • Logic overload: system prompt branches have become too complex to maintain or debug
  • Tool overload: tools overlap in function, causing the model to make the wrong choice between similar-looking options

Multi-agent structures

  • Manager (centralised): one orchestrator agent talks to the user; specialist sub-agents do the work and return results — best when a consistent persona matters
  • Decentralised (triage): a triage agent routes the conversation to specialist agents who then take over directly — best for high-volume, many-topic interactions where no single persona is needed

Guardrails

  • Place input and output guards between the user and the core agent
  • Input guards check: relevance (on-topic?), prompt injection detection, moderation (harmful content), rule-based filters (PII via regex, blacklists, context-window limits)
  • Output guards check: no PII leakage, brand/tone compliance
  • Tool-level safeguards: assign risk scores (low/medium/high) to tools; high-risk tools require higher user authorisation and additional checks before execution
  • Start small — implement the easiest guardrails first (PII, known injection patterns, relevance), then add edge cases iteratively
  • Balance security with user experience — over-restricting harms product usability

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.