MCP in production: why it fails and how to use it safely

Executive overview

MCP agents fail 20–50% of the time with today's best models. Most businesses cannot accept that failure rate for anything customer-facing or critical.

The case for MCP is real, but narrow. Use it only for adaptable, non-critical tasks where a human reviews the output. For everything else, use code — or wait six months.

MCP is not production-ready for most business use cases today; the gap is reliability, not potential.

The Friday afternoon test

Ask: would you deploy this MCP to production on a Friday afternoon?
If the answer is no, use code instead.
Three qualifying questions before proceeding:
1. Can it fail without catastrophe?
2. Is a human checking the output before it reaches customers?
3. Am I solving something fuzzy that keeps changing?
All three must be yes. If not, wait.

Good and bad use cases

Good:

Drafting responses to varying customer complaints (human reviews before sending)
Deep research — pulling documents across changing requirements
Triaging unpredictable support questions before routing to a human

Bad:

Processing payments
Updating medical records
Executing trades
Anything where a wrong action causes irreversible business harm

Four rules for building MCPs in production

Avoid the tool trap — start with as few tools as possible; each additional tool increases failure rate. Tools must be distinctly different, not merely similar. Enable dynamic tool discovery so the agent only loads what it needs at that moment.
Talk to AI like a human — use Markdown instead of JSON blobs; replace raw error codes with plain-language explanations; be concise to avoid expensive runaway tool calls.
Plan for chaos — cap the number of tool calls per run; log every call so failures are diagnosable; set timeouts so a stalled tool call doesn't block the workflow indefinitely.
Lock down dangerous actions — require human approval for any irreversible action; use SSE or streamable HTTP (not stdio) for production transports; apply least-privilege data sharing; design for the worst-case scenario where the agent goes fully off-rails.

What needs to improve before MCP is broadly trustworthy

Multi-turn tool calling — the top-ranked model on long-context benchmarks fails 16% of the time; Claude 3.7 succeeds at only 16% on complex multi-tool airline booking tasks. Track tau-bench results over time.
Long-context memory — accuracy degrades sharply beyond ~192k tokens for most models; Claude 4 Opus drops to 36–37% at extended context lengths. Agents that chain many tool calls will hit these limits fast.
Reasoning — Humanity's Last Exam top score is ~21/100; stronger reasoning directly improves correct tool selection under complexity.

Where MCP is working today

Coding tools (Cursor, Windsurf) — code provides a structured environment with clear right/wrong answers, immediate error visibility, and a developer always in the loop.
Customer service triage — routing and classifying questions (not resolving them); one study showed 90% triage accuracy with MCP tools vs 60% without.
Deep research — aggregating information from multiple sources where requirements vary; the primary public proof point from OpenAI, Perplexity, Claude, and Grok.

How companies make it work

Hybrid approach — the workflow is not handed entirely to an agent:

~80% routine tasks handled by deterministic code
~15% variable, adaptive tasks routed through an AI agent with MCP tools
~5% edge cases escalated to a human

Validation sandwich — wrap every AI step with code validation:

Code validates the input format and data
AI processes and extracts or generates content
Code validates the output against expected structure and rules
Human approves before any critical action is taken

Both patterns share three core properties: limited scope (a narrow subtask, not the whole workflow), a human in the loop for consequential decisions, and a fallback path if the agent fails.

Guidance by audience

Developers — experiment freely; build and test MCP servers to develop intuition.
Small to medium businesses — wait six months unless you have a use case that fits the qualifying criteria precisely (non-critical, human-reviewed, adaptive).
Large enterprises — start with minor, failure-tolerant use cases only; do not deploy MCP to core business processes yet.

MCP in production: why it fails and how to use it safely

Executive overview

The Friday afternoon test

Good and bad use cases

Four rules for building MCPs in production

What needs to improve before MCP is broadly trustworthy

Where MCP is working today

How companies make it work

Guidance by audience

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

The Friday afternoon test

Good and bad use cases

Four rules for building MCPs in production

What needs to improve before MCP is broadly trustworthy

Where MCP is working today

How companies make it work

Guidance by audience

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.