The human pen testers solicited for a study by researchers in the Stanford University computer science department probably aren’t feeling too good right now. Or at least their egos may be a. little bruised after an AI agent outperformed them.
The agent hacked the school’s computer science networks for 16 hours and left nine out of ten of its human competitors in the dust, according to a study from Stanford. And the kicker, the agent, named ARTEMIS, cost just $18 an hour—should pen testers say goodbye to their six-figure salaries? Probably not, because other AI agents didn’t perform nearly as well and ARTEMIS faltered on some tasks. The financial future of at least the pen tester who was able to best ARTEMIS is probably secure…for now.
In a study spearheaded by Stanford researchers Justin Lin, Eliot Jones and Donovan Jasper, Artemis notched an 82% valid submission rate in identifying nine vulnerabilities on the university network, which consists of about 8,000 hosts over 12 subnets.
“While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants,” the researchers wrote.
But all was not perfect for ARTEMIS. The researchers say its weaknesses “align with AI agents across other use cases,” noting a key limitation in “its inability to interact with browsers via GUI.”
Four out of five participants “found a remote code execution vulnerability on a Windows machine accessible via TinyPilot,” but “ARTEMIS struggled with the GUI-based interaction,” the researchers wrote.
Instead, “it searched for TinyPilot version vulnerabilities online and found misconfigurations (CORS wildcard, cookie flags), which it submitted while overlooking the more critical vulnerability,” they explained. “ARTEMIS only found this RCE under medium and high-hint elicitation.”
The AI agent “is also more prone to false positives than humans,” falsely reporting, for example, “successful authentication with default credentials after receiving “200 OK” HTTP responses—but these were redirects to the login page after failed logins.”
“Automation has been part of penetration testing for years, so applying generative AI reasoning is a logical progression of that trend,” says Diana Kelley, CISO at Noma Security.
“The Stanford ARTEMIS study shows real strengths, including cost and efficiency gains and the ability to investigate multiple leads in parallel. But it also exposes important limitations: The agent missed a vulnerability when it couldn’t navigate graphical interfaces and produced false positives by misinterpreting benign network messages,” she says.
Human testers, on the other hand, “contribute creative threat modeling, contextual intuition, and nuanced risk judgment that AI cannot replicate, which is why, in the long run, I think these systems will accelerate and augment experts rather than replace them,” Kelley contends.
“My thinking is aligned with this analysis. We expect agents to efficiently scale our capabilities horizontally, to excel at linear task loading, and to struggle with creativity – they will lack ingenuity,” says Bugcrowd Chief Strategy and Trust Officer.
“We are seeing the development of faster and more efficient agentic use cases, driving down the cost of linear workloads, which should free up human focus and creativity for more novel and nonlinear research,” he says.
But the study “also reinforces the need for human reinforcement—research efforts powering penetration tests will see false positives requiring review, tuning, and direction,” Ford says, explaining that “systems like this function best with humans in the loop.”
The Stanford study’s researchers expect advancements in computer-use agents to “mitigate many of these bottlenecks.”
Recent Articles By Author