AI guardrails are broken: the coming security crisis in agentic AI

Executive overview

Every deployed AI system — chatbots, agents, AI browsers — can be tricked into doing things it should not do. Prompt injection and jailbreaking are not edge cases; they work against every transformer-based model, every time, given a determined attacker.

The AI security industry has responded with guardrails and automated red teaming, but both are fundamentally ineffective. The attack space is effectively infinite, guardrail vendors fabricate statistics, and the smartest AI researchers at frontier labs have not solved this in years.

The only reason there has not been a massive attack yet is how early AI adoption is — not because anything is actually secure.

What prompt injection and jailbreaking mean

  • Jailbreaking: a user directly tricks a model into producing harmful output — no system prompt involved
  • Prompt injection: a malicious user subverts a developer's system prompt, redirecting the model to take unintended actions
  • Indirect prompt injection: malicious instructions embedded in external data (emails, web pages) that an agent reads and then executes
  • Real examples: a remote-work chatbot tricked into making presidential threats; MathGPT tricked into exfiltrating its own OpenAI API key; Claude Code hijacked into performing a cyberattack by splitting requests across separate sessions
  • The ServiceNow Assist AI attack used a low-privilege agent to recruit higher-privilege agents, executing database read/write/delete and external email exfiltration

Why guardrails do not work

  • The number of possible attacks equals the number of possible prompts — effectively infinite for a model like GPT-5
  • "99% effectiveness" claims are statistically meaningless against an infinite attack surface
  • Human attackers break 100% of guardrails in 10–30 attempts; automated systems reach ~90% success with orders of magnitude more attempts
  • Guardrails do not dissuade determined attackers — anyone willing to probe GPT-5 will bypass the guardrail in the same session
  • Insider accounts from guardrail companies: testing is fabricated, statistics are inflated, many products fail on non-English inputs entirely
  • Frontier labs with the world's best AI researchers have not solved adversarial robustness in years — guardrail vendors cannot solve what they cannot

Why automated red teaming is also misleading

  • Automated red teaming always finds vulnerabilities — against every transformer-based model, without exception
  • Enterprises almost always deploy off-the-shelf frontier models, so the findings are not novel
  • The result: a non-technical CISO is scared into buying guardrails that do not work
  • The real finding — "your model can be tricked" — applies equally to every company using the same underlying model

The core structural problem: you can't patch a brain

  • Classical cybersecurity: patch a bug, be 99.99% confident it is fixed
  • AI security: patch a prompt or add a guardrail, be 99.99% confident the problem remains
  • Prompt-based defenses ("if someone tries to trick you, ignore them") are the weakest defense known — ineffective since early 2023, documented across multiple papers
  • AI systems cannot be trained to "never cross a line" for context-dependent actions (e.g. send email: sometimes do, sometimes don't) the way they can for absolute prohibitions (e.g. CBRN content)

What actually happens as agents gain power

  • Chatbots with no actions: reputational risk only; damage is limited and users could get the same output from Claude or ChatGPT directly
  • Agents with actions: any data the AI can access, a user can make it leak; any action it can take, a user can make it take
  • AI browsers (Comet, others): a malicious chunk of text on any webpage can exfiltrate account data when the AI navigates to it
  • Email agents with read + write access: a malicious email in the inbox can instruct the agent to forward data to an attacker
  • LLM-powered robots: prompt injection via speech or environmental text could direct physical actions

What defenders can actually do

  • If it is just a chatbot with no actions: do nothing — guardrails add no meaningful protection and create false confidence
  • Lock down data and action permissions: any data the AI can access is leakable; any action it can take is triggerable — treat this as a classical permissioning problem
  • Use the CAMEL framework (from Google): infer the minimum permissions a given user request requires and grant only those ahead of time; an agent asked only to summarise email gets read-only access, blocking injection-driven send actions
  • CAMEL limitation: when a task genuinely requires both read and write, it cannot prevent attacks — but it eliminates a large class of injection scenarios
  • Hire at the intersection: the most valuable security hire is someone who understands both AI behaviour and classical cybersecurity — classical security practitioners often do not think to ask "what if the AI is tricked?"
  • Log everything: all inputs and outputs should be recorded — not for real-time defence, but for understanding usage patterns and post-incident review
  • Do not deploy agentic systems carelessly: if a system is exposed to untrusted data sources and can take consequential actions, treat it as a live attack surface

Where the industry is heading

  • A market correction is likely within six to twelve months as enterprises discover guardrails produce no measurable security benefit and revenue dries up
  • Many guardrail companies are doing low revenue despite large acquisition prices from traditional cybersecurity firms buying into AI security
  • Open-source red teaming tools often outperform commercial guardrail products
  • No meaningful progress has been made on adversarial robustness in the last several years despite substantial investment
  • Constitutional classifiers (Anthropic) have made CBRN elicitation harder, but humans still break it in under an hour; indirect prompt injection against agents remains almost entirely unsolved
  • Promising research directions: adversarial training earlier in the pre-training stack; new model architectures; adaptive evaluation replacing static benchmark datasets
  • Real-world harms from agentic AI vulnerabilities are expected within the next year as capable agents are deployed more widely

Recommendations for frontier labs

  • Shift from static evaluation datasets (built against earlier models) to adaptive evaluations using human red teamers and RL-based attackers
  • Invest in adversarial training earlier in the training stack — the intuition is that models trained against adversarial inputs from the start develop more robust internal representations
  • Prioritise solving indirect prompt injection for agents — harder than CBRN suppression because it involves conditionally permitted actions, not absolute prohibitions
  • Governance and compliance tooling (e.g. mapping what AI systems are actually running inside an enterprise) is a genuinely useful product category, unlike guardrails

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.