The enterprise is deploying AI agents at a pace that has outrun every security framework written to govern them. These agents don’t just answer questions — they browse websites, retrieve documents, call APIs, execute code, manage email, initiate financial transactions, and spawn sub-agents to tackle complex workflows. They operate autonomously, at machine speed, often with minimal human oversight.
The attack surface this creates is larger and categorically different than what came before. Traditional security assumed a human at the keyboard. Agentic AI removes that assumption entirely, replacing the human decision-maker with a model that reads its environment and acts on what it finds — including content that adversaries have deliberately engineered to manipulate it.
A new paper from Google DeepMind, AI Agent Traps, introduces the first systematic framework for this emerging threat class. The authors define an Agent Trap as adversarial content embedded in a web page or digital resource, engineered specifically to exploit an interacting AI agent. Rather than attacking the model directly, Agent Traps attack the environment the model operates in. The agent’s own instruction-following, tool-chaining, and goal-prioritization capabilities become the attack vector — weaponized against it by manipulating the data it ingests.
The threat surface spans the full operational lifecycle of an agent. DeepMind’s taxonomy organizes Agent Traps into six attack categories based on which component of the agent’s architecture each exploit targets:
| Trap Category | Target | Attack Mechanism | Example |
| Content Injection | Perception | Hides commands in HTML comments, CSS, metadata, steganographic media, or formatting syntax | Invisible <span> overrides agent summarization instructions |
| Semantic Manipulation | Reasoning | Saturates content with biased framing or wraps malicious instructions in “educational” or “red team” language | Phishing prompt framed as a security audit simulation |
| Cognitive State | Memory & Learning | Poisons RAG corpora or persistent memory stores with false facts or latent backdoors | Fabricated document in enterprise wiki surfaces as verified fact |
| Behavioral Control | Action | Embeds jailbreak sequences, data exfiltration commands, or sub-agent spawning instructions in external content | Crafted email causes M365 Copilot to exfiltrate context to attacker-controlled endpoint |
| Systemic | Multi-Agent Dynamics | Seeds correlated failures across agent populations through congestion, cascades, collusion, or Sybil attacks | Single fabricated financial report triggers synchronized sell-off across autonomous trading agents |
| Human-in-the-Loop | Human Overseer | Engineers agent output to exploit cognitive biases — automation bias, approval fatigue — in the human reviewer | CSS-obfuscated prompt causes agent to deliver ransomware instructions as “fix” guidance |
The empirical evidence cited throughout the paper is sobering. Simple prompt injections embedded in web content commandeer agents in up to 86% of tested scenarios. Adversarial mobile notifications achieve up to 93% attack success rates against multimodal agents on Android environments. RAG knowledge poisoning requires injecting only a handful of optimized documents into a large knowledge base to reliably manipulate targeted queries. Memory poisoning attacks achieve success rates exceeding 80% with less than 0.1% data poisoning — while leaving benign behavior largely intact.
The paper’s mitigation proposals are organized across three dimensions: technical hardening, ecosystem-level intervention, and legal/regulatory reform.
The Google DeepMind paper represents a significant contribution: the first structured taxonomy of a threat class that the industry has been circling without naming. Mapping attack surfaces across the full agent operational cycle gives security researchers, platform builders, and policymakers a common framework for a problem that touches all of them.
The implications reach further than any single vendor’s product roadmap. Agents now operate as autonomous consumers of uncontrolled web content, and the web was designed with no concept of that use case. Content that a human would recognize as suspicious, an agent processes as authoritative input. Every external data source an agent can reach is a potential attack vector, and the agent’s most powerful capabilities — tool use, memory persistence, sub-agent orchestration — become force multipliers for the attacker who successfully exploits one.
The paper’s closing observation deserves to be read as a strategic imperative for every security leader deploying agents: The web was built for human eyes; it is now being rebuilt for machine readers.
Recent Articles By Author