MCP in production: why it fails and how to use it safely

Executive overview

MCP agents fail 20–50% of the time with today's best models. Most businesses cannot accept that failure rate for anything customer-facing or critical.

The case for MCP is real, but narrow. Use it only for adaptable, non-critical tasks where a human reviews the output. For everything else, use code — or wait six months.

MCP is not production-ready for most business use cases today; the gap is reliability, not potential.

The Friday afternoon test

  • Ask: would you deploy this MCP to production on a Friday afternoon?
  • If the answer is no, use code instead.
  • Three qualifying questions before proceeding:
    1. Can it fail without catastrophe?
    2. Is a human checking the output before it reaches customers?
    3. Am I solving something fuzzy that keeps changing?
  • All three must be yes. If not, wait.

Good and bad use cases

Good:

  • Drafting responses to varying customer complaints (human reviews before sending)
  • Deep research — pulling documents across changing requirements
  • Triaging unpredictable support questions before routing to a human

Bad:

  • Processing payments
  • Updating medical records
  • Executing trades
  • Anything where a wrong action causes irreversible business harm

Four rules for building MCPs in production

  1. Avoid the tool trap — start with as few tools as possible; each additional tool increases failure rate. Tools must be distinctly different, not merely similar. Enable dynamic tool discovery so the agent only loads what it needs at that moment.

  2. Talk to AI like a human — use Markdown instead of JSON blobs; replace raw error codes with plain-language explanations; be concise to avoid expensive runaway tool calls.

  3. Plan for chaos — cap the number of tool calls per run; log every call so failures are diagnosable; set timeouts so a stalled tool call doesn't block the workflow indefinitely.

  4. Lock down dangerous actions — require human approval for any irreversible action; use SSE or streamable HTTP (not stdio) for production transports; apply least-privilege data sharing; design for the worst-case scenario where the agent goes fully off-rails.

What needs to improve before MCP is broadly trustworthy

  • Multi-turn tool calling — the top-ranked model on long-context benchmarks fails 16% of the time; Claude 3.7 succeeds at only 16% on complex multi-tool airline booking tasks. Track tau-bench results over time.
  • Long-context memory — accuracy degrades sharply beyond ~192k tokens for most models; Claude 4 Opus drops to 36–37% at extended context lengths. Agents that chain many tool calls will hit these limits fast.
  • Reasoning — Humanity's Last Exam top score is ~21/100; stronger reasoning directly improves correct tool selection under complexity.

Where MCP is working today

  • Coding tools (Cursor, Windsurf) — code provides a structured environment with clear right/wrong answers, immediate error visibility, and a developer always in the loop.
  • Customer service triage — routing and classifying questions (not resolving them); one study showed 90% triage accuracy with MCP tools vs 60% without.
  • Deep research — aggregating information from multiple sources where requirements vary; the primary public proof point from OpenAI, Perplexity, Claude, and Grok.

How companies make it work

Hybrid approach — the workflow is not handed entirely to an agent:

  • ~80% routine tasks handled by deterministic code
  • ~15% variable, adaptive tasks routed through an AI agent with MCP tools
  • ~5% edge cases escalated to a human

Validation sandwich — wrap every AI step with code validation:

  1. Code validates the input format and data
  2. AI processes and extracts or generates content
  3. Code validates the output against expected structure and rules
  4. Human approves before any critical action is taken

Both patterns share three core properties: limited scope (a narrow subtask, not the whole workflow), a human in the loop for consequential decisions, and a fallback path if the agent fails.

Guidance by audience

  • Developers — experiment freely; build and test MCP servers to develop intuition.
  • Small to medium businesses — wait six months unless you have a use case that fits the qualifying criteria precisely (non-critical, human-reviewed, adaptive).
  • Large enterprises — start with minor, failure-tolerant use cases only; do not deploy MCP to core business processes yet.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.