The original is one click away. Open original ↗
MCP in production: why it fails and how to use it safely
Executive overview
MCP agents fail 20–50% of the time with today's best models. Most businesses cannot accept that failure rate for anything customer-facing or critical.
The case for MCP is real, but narrow. Use it only for adaptable, non-critical tasks where a human reviews the output. For everything else, use code — or wait six months.
MCP is not production-ready for most business use cases today; the gap is reliability, not potential.
The Friday afternoon test
- Ask: would you deploy this MCP to production on a Friday afternoon?
- If the answer is no, use code instead.
- Three qualifying questions before proceeding:
- Can it fail without catastrophe?
- Is a human checking the output before it reaches customers?
- Am I solving something fuzzy that keeps changing?
- All three must be yes. If not, wait.
Good and bad use cases
Good:
- Drafting responses to varying customer complaints (human reviews before sending)
- Deep research — pulling documents across changing requirements
- Triaging unpredictable support questions before routing to a human
Bad:
- Processing payments
- Updating medical records
- Executing trades
- Anything where a wrong action causes irreversible business harm
Four rules for building MCPs in production
-
Avoid the tool trap — start with as few tools as possible; each additional tool increases failure rate. Tools must be distinctly different, not merely similar. Enable dynamic tool discovery so the agent only loads what it needs at that moment.
-
Talk to AI like a human — use Markdown instead of JSON blobs; replace raw error codes with plain-language explanations; be concise to avoid expensive runaway tool calls.
-
Plan for chaos — cap the number of tool calls per run; log every call so failures are diagnosable; set timeouts so a stalled tool call doesn't block the workflow indefinitely.
-
Lock down dangerous actions — require human approval for any irreversible action; use SSE or streamable HTTP (not stdio) for production transports; apply least-privilege data sharing; design for the worst-case scenario where the agent goes fully off-rails.
What needs to improve before MCP is broadly trustworthy
- Multi-turn tool calling — the top-ranked model on long-context benchmarks fails 16% of the time; Claude 3.7 succeeds at only 16% on complex multi-tool airline booking tasks. Track tau-bench results over time.
- Long-context memory — accuracy degrades sharply beyond ~192k tokens for most models; Claude 4 Opus drops to 36–37% at extended context lengths. Agents that chain many tool calls will hit these limits fast.
- Reasoning — Humanity's Last Exam top score is ~21/100; stronger reasoning directly improves correct tool selection under complexity.
Where MCP is working today
- Coding tools (Cursor, Windsurf) — code provides a structured environment with clear right/wrong answers, immediate error visibility, and a developer always in the loop.
- Customer service triage — routing and classifying questions (not resolving them); one study showed 90% triage accuracy with MCP tools vs 60% without.
- Deep research — aggregating information from multiple sources where requirements vary; the primary public proof point from OpenAI, Perplexity, Claude, and Grok.
How companies make it work
Hybrid approach — the workflow is not handed entirely to an agent:
- ~80% routine tasks handled by deterministic code
- ~15% variable, adaptive tasks routed through an AI agent with MCP tools
- ~5% edge cases escalated to a human
Validation sandwich — wrap every AI step with code validation:
- Code validates the input format and data
- AI processes and extracts or generates content
- Code validates the output against expected structure and rules
- Human approves before any critical action is taken
Both patterns share three core properties: limited scope (a narrow subtask, not the whole workflow), a human in the loop for consequential decisions, and a fallback path if the agent fails.
Guidance by audience
- Developers — experiment freely; build and test MCP servers to develop intuition.
- Small to medium businesses — wait six months unless you have a use case that fits the qualifying criteria precisely (non-critical, human-reviewed, adaptive).
- Large enterprises — start with minor, failure-tolerant use cases only; do not deploy MCP to core business processes yet.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.