The original is one click away. Open original ↗
Why most AI products fail: Lessons from 50+ AI deployments
Executive overview
Building AI products breaks two assumptions that underpin all traditional software: the system is non-deterministic on both input and output, and every increase in agent autonomy reduces human control. Most teams ignore both until they're debugging an uncontrollable mess.
The fix is to start at minimum autonomy and graduate deliberately — building a flywheel of behavioral data before expanding what the agent can do.
The core insight: don't start with agents. Start with a constrained, human-in-the-loop system, earn trust through observed behavior, then expand autonomy step by step.
The two fundamental differences in AI product development
- Traditional software maps intent to action through deterministic buttons and forms; AI replaces that with natural language, so neither user input nor LLM output is predictable
- Non-determinism compounds: you don't know how users will phrase requests, and you don't know how the model will respond — you're building for unknown inputs, unknown outputs, and an unknown process
- The agency-control trade-off: every decision you delegate to an agent is control you give up — the agent must earn that trust before you hand it over
- Jumping straight to full autonomy (V3) is the most common failure mode — it makes debugging intractable and erodes user trust before you understand how the system behaves
The graduated autonomy model
- Start with high control, low agency: the AI suggests, humans decide — e.g., a support agent that drafts replies the human reviews before sending
- Each version should expand one capability, not many — add tools, add scope, reduce human checkpoints only once confidence is established
- Example progression for a customer support agent:
- Routing only — classify and assign tickets; humans correct errors; reveals messy taxonomy and data quality issues
- Co-pilot — draft resolutions for human review; implicit error analysis is free because you log every edit
- Resolution assistant — draft and send autonomously once quality is validated
- Other progressions follow the same pattern: coding assistants (inline suggestions → PR generation → autonomous PRs), marketing tools (copy drafts → campaign builds → autonomous A/B testing)
- 74–75% of enterprises cite reliability as the top blocker to deploying customer-facing AI (UC Berkeley / Databricks research)
The continuous calibration, continuous development (CCCD) framework
- Continuous development loop: scope the capability → curate a seed dataset of expected inputs/outputs → set evaluation metrics → deploy → evaluate
- The seed dataset is also a forcing function: teams that skip it often discover they aren't aligned on how the product should behave
- Continuous calibration loop: observe production behavior → spot error patterns → apply fixes → design new evaluation metrics for emerging failure modes
- Evaluation metrics only catch errors you anticipated; production signals catch the ones you didn't — both are required
- Implicit signals matter as much as explicit feedback: answer regeneration in ChatGPT is a stronger signal than thumbs-down because users often don't bother rating
- Know when to move to the next autonomy stage by monitoring information gain: when you stop seeing new distribution patterns, the current stage is calibrated
- Recalibration is triggered by model changes (e.g., 4.0 → 5), user behavior shifts, or new use-case discovery
Evals: what they are and aren't
- Evals = your product knowledge encoded as a dataset of cases the system must not get wrong; they catch anticipated failure modes
- Production monitoring = the signal layer that surfaces unanticipated failure modes and tells you which traces to inspect
- Neither replaces the other; the false dichotomy of "evals vs. vibes" collapses the distinction between pre-deployment testing and post-deployment observation
- "Evals" has undergone semantic diffusion — data labeling firms, PMs writing PRDs, LLM judges, and model benchmarks are all called evals; the underlying question is simply: do you have an actionable feedback loop?
- LLM judges work well for stable, predictable failure modes; for highly customizable products (e.g., Codex), emerging patterns outpace what judges can cover — customer signals and team review fill the gap
- Don't build an LLM judge until you've confirmed you can't cover the failure mode with production monitoring and human review
What successful teams look like
- Leaders rebuild their intuitions hands-on — daily time blocks for AI learning, weekend coding sessions, willingness to be the least-knowledgeable person in the room; top-down buy-in is almost always the distinguishing factor
- Culture of empowerment over fear of replacement — subject matter experts are essential for evaluating AI behavior; if they feel threatened, they disengage and quality suffers
- Workflow obsession — successful teams spend 80% of their time understanding existing workflows and identifying where AI fits, not chasing the latest model or framework
- Right tool for each step: ML model, deterministic code, and AI agent are not interchangeable — pick based on the problem
- Flywheels beat first-mover advantage: being first to deploy an agent matters far less than having a system that improves over time
- Skepticism toward "one-click agents" is warranted — enterprise data is messy, taxonomies are inconsistent, and meaningful ROI typically takes four to six months even with good infrastructure
Multi-agents, coding agents, and what's next
- Multi-agents (misunderstood, not just overhyped): supervisor-plus-subagent hierarchies work; peer-to-peer gossip protocols among agents are very hard to control and rarely succeed in production
- Coding agents (underrated): high chatter online but low penetration in most companies outside major tech hubs — the productivity gains are real and largely untapped
- Near-term direction: background/proactive agents that observe your workflow context, surface relevant actions, and present completed work for review (e.g., "I fixed five Linear tickets overnight — review the patches")
- Multi-modal: LLMs are one modality; richer input (vision, voice, world models) will unlock use cases blocked by messy PDFs and undocumented processes
Skills that will matter most
- Design and judgment over execution — building is becoming nearly free; the scarce resource is knowing what to build and for whom
- Persistence through iteration — "pain is the new moat": companies that go through the learning cycles of trying, failing, and adapting build knowledge competitors can't buy
- Problem-first thinking — resist tool obsession; 80% of the work is understanding the customer's workflow and data, not picking the fanciest model
- Agency and ownership — the willingness to just build a thing and ship it, even imperfectly, is a stronger signal than credentials
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.