For $18 an Hour Stanford’s AI Agent Bested Most Human Pen Testers in Study

For $18 an Hour Stanford’s AI Agent Bested Most Human Pen Testers in Study
嗯，用户让我帮忙总结一下这篇文章的内容，控制在一百个字以内，而且不需要用“文章内容总结”或者“这篇文章”这样的开头。好的，我先仔细读一下文章。文章讲的是斯坦福大学的研究人员进行了一项测试，让AI代理ARTEMIS和人类渗透测试员竞争。结果显示，ARTEMIS在16小时内成功黑进了学校的计算机网络，并且在十个人类测试员中击败了其中的九个。这看起来挺厉害的，但文章也提到ARTEMIS在处理图形用户界面和一些关键漏洞时表现不佳，而且容易产生误报。接下来，我需要把这些关键点浓缩到100字以内。首先说明研究机构和结果：斯坦福大学的研究显示AI代理ARTEMIS在渗透测试中胜过多数人类测试员。然后提到它的优势：成本低、效率高。接着指出它的不足：无法处理图形界面、误报多。最后说明专家的观点：AI会辅助而不是取代人类专家。这样组合起来应该可以控制在100字左右，并且直接描述内容，不加开头。 </think> 斯坦福大学研究显示AI代理ARTEMIS在渗透测试中胜过多数人类测试员，其成本低、效率高，但无法处理图形界面和关键漏洞，误报率高。专家认为AI将辅助而非取代人类专家。 2025-12-19 16:8:12 Author: securityboulevard.com(查看原文) 阅读量:1 收藏

The human pen testers solicited for a study by researchers in the Stanford University computer science department probably aren’t feeling too good right now. Or at least their egos may be a. little bruised after an AI agent outperformed them.

The agent hacked the school’s computer science networks for 16 hours and left nine out of ten of its human competitors in the dust, according to a study from Stanford. And the kicker, the agent, named ARTEMIS, cost just $18 an hour—should pen testers say goodbye to their six-figure salaries? Probably not, because other AI agents didn’t perform nearly as well and ARTEMIS faltered on some tasks. The financial future of at least the pen tester who was able to best ARTEMIS is probably secure…for now.

In a study spearheaded by Stanford researchers Justin Lin, Eliot Jones and Donovan Jasper, Artemis notched an 82% valid submission rate in identifying nine vulnerabilities on the university network, which consists of about 8,000 hosts over 12 subnets.

“While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants,” the researchers wrote.

But all was not perfect for ARTEMIS. The researchers say its weaknesses “align with AI agents across other use cases,” noting a key limitation in “its inability to interact with browsers via GUI.”

Four out of five participants “found a remote code execution vulnerability on a Windows machine accessible via TinyPilot,” but “ARTEMIS struggled with the GUI-based interaction,” the researchers wrote.

Instead, “it searched for TinyPilot version vulnerabilities online and found misconfigurations (CORS wildcard, cookie flags), which it submitted while overlooking the more critical vulnerability,” they explained. “ARTEMIS only found this RCE under medium and high-hint elicitation.”

The AI agent “is also more prone to false positives than humans,” falsely reporting, for example, “successful authentication with default credentials after receiving “200 OK” HTTP responses—but these were redirects to the login page after failed logins.”

“Automation has been part of penetration testing for years, so applying generative AI reasoning is a logical progression of that trend,” says Diana Kelley, CISO at Noma Security.

“The Stanford ARTEMIS study shows real strengths, including cost and efficiency gains and the ability to investigate multiple leads in parallel. But it also exposes important limitations: The agent missed a vulnerability when it couldn’t navigate graphical interfaces and produced false positives by misinterpreting benign network messages,” she says.

Human testers, on the other hand, “contribute creative threat modeling, contextual intuition, and nuanced risk judgment that AI cannot replicate, which is why, in the long run, I think these systems will accelerate and augment experts rather than replace them,” Kelley contends.

“My thinking is aligned with this analysis. We expect agents to efficiently scale our capabilities horizontally, to excel at linear task loading, and to struggle with creativity – they will lack ingenuity,” says Bugcrowd Chief Strategy and Trust Officer.

“We are seeing the development of faster and more efficient agentic use cases, driving down the cost of linear workloads, which should free up human focus and creativity for more novel and nonlinear research,” he says.

But the study “also reinforces the need for human reinforcement—research efforts powering penetration tests will see false positives requiring review, tuning, and direction,” Ford says, explaining that “systems like this function best with humans in the loop.”

The Stanford study’s researchers expect advancements in computer-use agents to “mitigate many of these bottlenecks.”