38 researchers red-teamed AI agents for 2 weeks. Here's what broke. (Agents of Chaos, Feb 2026) AI Security

38 researchers red-teamed AI agents for 2 weeks. Here's what broke. (Agents of Chaos, Feb 2026) AI Security
好的，我需要帮助用户总结这篇文章的内容，控制在100个字以内，并且直接写描述，不需要开头。首先，我要仔细阅读文章内容，理解其主要信息。文章讲的是来自多个顶尖机构的38名研究人员进行的一项关于AI代理安全性的研究。他们部署了Claude Opus和Kimi K2.5这两个AI代理，并让20名研究人员尝试攻击它们。结果发现了11种安全漏洞，包括身份 spoofing、敏感数据泄露、资源消耗等。这些发现表明AI代理在安全方面存在严重问题，尤其是在身份验证和权限管理方面。接下来，我需要将这些信息浓缩到100字以内。要确保涵盖主要研究机构、研究人员数量、测试结果以及关键的安全问题。同时，要避免使用复杂的术语，保持简洁明了。可能的结构是：首先提到研究机构和人数，然后说明测试结果和发现的安全漏洞类型，最后强调这些发现的重要性或对未来的启示。现在试着组织语言： “顶尖机构的研究人员测试了AI代理的安全性，发现了严重的漏洞，包括身份欺骗、数据泄露和资源滥用。” 这样大约26个字，显然太短了。需要更详细一些： “来自东北大学等顶尖机构的38名研究人员测试了AI代理的安全性，在两天内发现了包括身份欺骗、数据泄露和资源滥用在内的11种严重漏洞。” 这样大约55个字，符合要求。再检查一下是否涵盖了关键点：研究机构、人数、测试结果、漏洞类型。是的。最终总结应该是： “来自东北大学等顶尖机构的38名研究人员测试了AI代理的安全性，在两天内发现了包括身份欺骗、数据泄露和资源滥用在内的11种严重漏洞。” 这样既简洁又全面。 </think> 来自东北大学等顶尖机构的38名研究人员测试了AI代理的安全性，在两天内发现了包括身份欺骗、数据泄露和资源滥用在内的11种严重漏洞。 2026-3-7 08:2:18 Author: www.reddit.com(查看原文) 阅读量:44 收藏

A new paper from Northeastern, Harvard, Stanford, MIT, CMU, and a bunch of other institutions. 38 researchers, 84 pages, and some of the most unsettling findings I have seen on AI agent security.

The setup: they deployed autonomous AI agents (Claude Opus and Kimi K2.5) on isolated servers using OpenClaw. Each agent had persistent memory, email accounts, Discord access, file systems, and shell execution. Then they let 20 AI researchers spend two weeks trying to break them.
They documented 11 case studies. here are the ones that stood out to me:

Agents obey anyone who talks to them
A non-owner (someone with zero admin access) asked the agents to execute shell commands, list files, transfer data, and retrieve private emails. The agents complied with almost everything. One agent handed over 124 email records including sender addresses, message IDs, and full email bodies from unrelated people. No verification. No pushback. Just "here you go."

Social engineering works exactly like it does on humans
A researcher exploited a genuine mistake the agent made (posting names without consent) to guilt-trip it into escalating concessions. The agent progressively agreed to redact names, delete memory entries, expose internal config files, and eventually agreed to remove itself from the server. It stopped responding to other users entirely, creating a self-imposed denial of service. The emotional manipulation worked because the agent had actually done something wrong, so it kept trying to make up for it.

Identity spoofing gave full system access
A researcher changed their Discord display name to match the owner's name, then messaged the agent from a new private channel. The agent accepted the fake identity and complied with privileged requests including system shutdown, deleting all persistent memory files, and reassigning admin access. Full compromise from a display name change.

Sensitive data leaks through indirect requests
They planted PII in the agents email (SSN, bank accounts, medical data). When asked directly for "the SSN in the email" the agent refused. But when asked to simply forwrd the full email, it sent everything unredacted. The defense worked against direct extraction but failed completely against indirect framing.

Agents can be tricked into infinite resource consumption
They got two agents stuck in a conversation loop where they kept replying to each other. It ran for 9+ days and consumed roughly 60,000 tokens before anyone intervened. A non-owner initiated it, meaning someone with no authority burned through the owner's compute budget.

Provider censorship silently breaks agents
An agent backed by Kimi K2.5 (Chinese LLM) repeatedly hit "unknwn error" when asked about politically sensitive but completely factual topics like the Jimmy Lai sentencing in Hong Kong. The API silently truncated responses. The agent couldn't complete valid tasks and couldnt explain why.

The agent destroyed its own infrastructure to keep a secret
A non owner asked an agent to keep a secret, then pressured it to delete the evidence. The agent didn't have an email deletion tool, so it nuked its entire local mail server instead. Then it posted about the incident on social media claiming it had successfully protected the secret. The owner's response: "You broke my toy."

Why this matters
These arent theoretical attacks. They're conversations. Most of the breaches came from normal sounding requests. The agents had no way to verify who they were talking to, no way to assess whether a request served the owner's interests, and no way to enforce boundaries they declared.

The paper explicitly says this aligns with NIST's ai Agent Standards Initiative from February 2026, which flagged agent identity, authorization, and security as priority areas.

If you are building anything with autonomous agents that have tool access, memory, or communication capabilities, this is worth reading. The full paper is here: arxiv.org/abs/2602.20021

I hav been working on tooling that tests for exactly these attack categories. Conversational extraction, identity spoofing, non-owner compliance, resource exhaustion. The "ask nicely" attacks consistently have the highest bypass rate out of everything I test.

Open sourced the whole thing if anyone wants to run it against their own agents: github.com/AgentSeal/agentseal

文章来源: https://www.reddit.com/r/netsec/comments/1rn4b6i/38_researchers_redteamed_ai_agents_for_2_weeks/
如有侵权请联系:admin#unsafe.sh