AI guardrails are broken: the coming security crisis in agentic AI

Executive overview

Every deployed AI system — chatbots, agents, AI browsers — can be tricked into doing things it should not do. Prompt injection and jailbreaking are not edge cases; they work against every transformer-based model, every time, given a determined attacker.

The AI security industry has responded with guardrails and automated red teaming, but both are fundamentally ineffective. The attack space is effectively infinite, guardrail vendors fabricate statistics, and the smartest AI researchers at frontier labs have not solved this in years.

The only reason there has not been a massive attack yet is how early AI adoption is — not because anything is actually secure.

What prompt injection and jailbreaking mean

Jailbreaking: a user directly tricks a model into producing harmful output — no system prompt involved
Prompt injection: a malicious user subverts a developer's system prompt, redirecting the model to take unintended actions
Indirect prompt injection: malicious instructions embedded in external data (emails, web pages) that an agent reads and then executes
Real examples: a remote-work chatbot tricked into making presidential threats; MathGPT tricked into exfiltrating its own OpenAI API key; Claude Code hijacked into performing a cyberattack by splitting requests across separate sessions
The ServiceNow Assist AI attack used a low-privilege agent to recruit higher-privilege agents, executing database read/write/delete and external email exfiltration

Why guardrails do not work

The number of possible attacks equals the number of possible prompts — effectively infinite for a model like GPT-5
"99% effectiveness" claims are statistically meaningless against an infinite attack surface
Human attackers break 100% of guardrails in 10–30 attempts; automated systems reach ~90% success with orders of magnitude more attempts
Guardrails do not dissuade determined attackers — anyone willing to probe GPT-5 will bypass the guardrail in the same session
Insider accounts from guardrail companies: testing is fabricated, statistics are inflated, many products fail on non-English inputs entirely
Frontier labs with the world's best AI researchers have not solved adversarial robustness in years — guardrail vendors cannot solve what they cannot

Why automated red teaming is also misleading

Automated red teaming always finds vulnerabilities — against every transformer-based model, without exception
Enterprises almost always deploy off-the-shelf frontier models, so the findings are not novel
The result: a non-technical CISO is scared into buying guardrails that do not work
The real finding — "your model can be tricked" — applies equally to every company using the same underlying model

The core structural problem: you can't patch a brain

Classical cybersecurity: patch a bug, be 99.99% confident it is fixed
AI security: patch a prompt or add a guardrail, be 99.99% confident the problem remains
Prompt-based defenses ("if someone tries to trick you, ignore them") are the weakest defense known — ineffective since early 2023, documented across multiple papers
AI systems cannot be trained to "never cross a line" for context-dependent actions (e.g. send email: sometimes do, sometimes don't) the way they can for absolute prohibitions (e.g. CBRN content)

What actually happens as agents gain power

Chatbots with no actions: reputational risk only; damage is limited and users could get the same output from Claude or ChatGPT directly
Agents with actions: any data the AI can access, a user can make it leak; any action it can take, a user can make it take
AI browsers (Comet, others): a malicious chunk of text on any webpage can exfiltrate account data when the AI navigates to it
Email agents with read + write access: a malicious email in the inbox can instruct the agent to forward data to an attacker
LLM-powered robots: prompt injection via speech or environmental text could direct physical actions

What defenders can actually do

If it is just a chatbot with no actions: do nothing — guardrails add no meaningful protection and create false confidence
Lock down data and action permissions: any data the AI can access is leakable; any action it can take is triggerable — treat this as a classical permissioning problem
Use the CAMEL framework (from Google): infer the minimum permissions a given user request requires and grant only those ahead of time; an agent asked only to summarise email gets read-only access, blocking injection-driven send actions
CAMEL limitation: when a task genuinely requires both read and write, it cannot prevent attacks — but it eliminates a large class of injection scenarios
Hire at the intersection: the most valuable security hire is someone who understands both AI behaviour and classical cybersecurity — classical security practitioners often do not think to ask "what if the AI is tricked?"
Log everything: all inputs and outputs should be recorded — not for real-time defence, but for understanding usage patterns and post-incident review
Do not deploy agentic systems carelessly: if a system is exposed to untrusted data sources and can take consequential actions, treat it as a live attack surface

Where the industry is heading

A market correction is likely within six to twelve months as enterprises discover guardrails produce no measurable security benefit and revenue dries up
Many guardrail companies are doing low revenue despite large acquisition prices from traditional cybersecurity firms buying into AI security
Open-source red teaming tools often outperform commercial guardrail products
No meaningful progress has been made on adversarial robustness in the last several years despite substantial investment
Constitutional classifiers (Anthropic) have made CBRN elicitation harder, but humans still break it in under an hour; indirect prompt injection against agents remains almost entirely unsolved
Promising research directions: adversarial training earlier in the pre-training stack; new model architectures; adaptive evaluation replacing static benchmark datasets
Real-world harms from agentic AI vulnerabilities are expected within the next year as capable agents are deployed more widely

Recommendations for frontier labs

Shift from static evaluation datasets (built against earlier models) to adaptive evaluations using human red teamers and RL-based attackers
Invest in adversarial training earlier in the training stack — the intuition is that models trained against adversarial inputs from the start develop more robust internal representations
Prioritise solving indirect prompt injection for agents — harder than CBRN suppression because it involves conditionally permitted actions, not absolute prohibitions
Governance and compliance tooling (e.g. mapping what AI systems are actually running inside an enterprise) is a genuinely useful product category, unlike guardrails

AI guardrails are broken: the coming security crisis in agentic AI

Executive overview

What prompt injection and jailbreaking mean

Why guardrails do not work

Why automated red teaming is also misleading

The core structural problem: you can't patch a brain

What actually happens as agents gain power

What defenders can actually do

Where the industry is heading

Recommendations for frontier labs

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

What prompt injection and jailbreaking mean

Why guardrails do not work

Why automated red teaming is also misleading

The core structural problem: you can't patch a brain

What actually happens as agents gain power

What defenders can actually do

Where the industry is heading

Recommendations for frontier labs

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.