Why AI agents still fail at multi-step tasks — and what to do about it

Executive overview

Most AI agent products ship without working reliably. Error rates compound across long workflows: at 90% accuracy per step, a 10-step task has only a 35% success rate. The industry has quietly normalised this.

Yutori's approach is to treat reliability as non-negotiable from day one, combining comprehensive evals, self-correction, and product craft to build agents users can trust.

Unreliability in agentic products is a choice — one the best builders should refuse to make.

Why agents break at scale

  • A 10-step workflow at 90% per-step accuracy has an overall success rate well below 50%
  • Error compounds fast — 20- or 50-step workflows are nearly guaranteed to fail at that rate
  • Most products paper over this with "it usually works if you try several times"
  • That normalisation of non-determinism is the core problem, not just a known limitation

What reliability actually requires

  • Agents must recognise when they make a mistake and backtrack — not just push through
  • Every production query runs through a comprehensive eval suite to flag weak domains
  • New websites always exist outside training data; robust error recovery matters more than memorised paths
  • Guardrails are built into the model training loop, not bolted on afterward

Product craft as a differentiator

  • In a world where anyone can prototype fast with LLMs, taste and craft separate durable products
  • The team dog-foods new features for 90 minutes every week; tens of experiments run internally before any ships externally
  • Small, unasked-for features — like auto-filling 2FA codes — make users feel seen
  • User requests inform priorities, but intuition drives the features users didn't know to ask for

Transparency as trust

  • Users can inspect every Scout report to see which sites were visited and what the agent looked at
  • This "proof of work" visibility is directly descended from Grad-CAM: showing what the model attended to, not just the output
  • Attention to visible detail signals reliability in the invisible parts of the product
  • Trust is built incrementally; it cannot be declared

The longer arc

  • Digital agents will arrive before physical agents — the timeline is shorter
  • The future interface is a higher level of abstraction: tell an assistant what you want, not how to click through a site
  • The goal is humans and agents working together for productivity, not replacement
  • Accessibility is a real benefit: non-technical users no longer need to learn every new website

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.