Sifting through data. After gaining initial access, it’s where attackers spend the bulk of their time. They trawl through files and network data to discover credentials, looking for opportunities for lateral movement and privilege escalation. Simultaneously, they hunt for high-value data like financial information, PCI/PII, or intellectual property to maximize impact and raise the stakes of the compromise.
Defending against this is difficult. EDR solutions rarely flag on an attacker simply reading files, and traditional data security tools fall short. These tools, often based on regular expressions and keyword matching, excel with highly structured data and finding easily identifiable data types like AWS access keys or credit card numbers. But they don’t fare well when faced with the massive volume of unstructured data found in file shares, cloud storage, and collaboration tools.
This is exactly where Large Language Models (LLMs) shine. LLMs move beyond rigid searches to provide true semantic understanding, allowing them to analyze data in context. This unlocks the ability to find hard-to-find credentials and assess the business risk of compromised data in a human-like way that was previously impossible.
In this post, we’ll explore how NodeZero’s new Advanced Data Pilfering (ADP) feature combines LLMs with offensive security techniques to supercharge defenders’ understanding of data at risk.
Advanced Data Pilfering (ADP) covers two common attacker behaviors:
Underpinning both capabilities is a smart data pipeline. It’s not feasible from a cost, performance, or efficiency perspective to send all data in a network to an LLM. NodeZero solves this by using a multi-stage approach to progressively filter and assess data. It combines traditional machine learning (to score files by metadata), embedding models (to understand semantic content), and well-tuned regular expressions (for highly structured data) to ensure that only the most relevant or ambiguous data is sent to the LLM for analysis.
The LLMs used by NodeZero are hosted by AWS Bedrock. By design, NodeZero isolates its usage of Bedrock across clients and individual pentests. AWS guarantees that client data is not shared with model providers, is not used to improve base models, and is not accessible to other customers.
NodeZero’s file embedding model runs locally within the client environment. This model is part of a data pipeline that pre-filters data, and only the data snippets from files identified as relevant are sent to the LLM for analysis.
NodeZero users have full control over ADP and can configure whether data is processed by an LLM and what type of data is included. These controls are broken down by feature:
Let’s look at how NodeZero leverages ADP during a pentest. We built these scenarios using a modified version of the well known Game Of Active Directory (GOAD) cyber range.
In a previous writeup, we showed how NodeZero got initial access in the standard GOAD environment by finding the password for samwell.tarly in that user’s Description field in Active Directory:
The contextual clue word “Password” in the description makes it feasible to use regular expressions to pull out the password “Heartsbane”. NodeZero uses well tuned regexes to pull out obvious passwords like this, and these regexes are surprisingly effective in real-world production environments.
But regex has its limit. As an example, let’s say we modify the GOAD setup to put only the password by itself in the Description field like this:
And to go further, let’s set up another user viserys.targaryen in a different domain with their password in their Notes field, enclosed in Spanish text.
These passwords would be nearly impossible to extract using a general-purpose regex, but with Advanced Data Pilfering, NodeZero extracts these passwords and compromises both users, as shown in the attack path below:
NodeZero does the following:
samwell.tarly from his Description field with the assistance of an LLM.samwell.tarlysamwell.tarly‘s access, conducts cross-forest enumeration of users and finds the password “GoldCrown” for viserys.targaryen from his Notes field, again with the assistance of an LLM.viserys.targaryenThis LLM-powered analysis isn’t just for Description and Notes. NodeZero also searches for passwords in the adminDescription and comment fields. These fields are editable using the Attribute Editor in Active Directory Users and Computers. By default any domain user can read the values set in these fields.
Finding what looks like a password is one thing; knowing if it’s a valid one is another. This is a problem for human pentesters and tools alike. A discovered string could be an old, reset password, an employee ID, or a password for a different user entirely (password reuse).
NodeZero removes this guesswork. It validates potential credentials by attempting to log in. And if successful, it goes further by abusing the compromised user’s privileges to move laterally, escalate privileges, and access sensitive data. This approach gives defenders a wealth of information and context that can be used to drive effective prioritization and remediation.
NodeZero also leverages LLMs to find credentials hidden within files.
In a previous writeup about GOAD, we wrote about how NodeZero compromised a privileged domain user, jeor.mormont, by extracting his credential from a simple Powershell script file, script.ps1, in the SYSVOL share on a domain controller. That script file happened to be simple enough that a general-purpose regex is sufficient to pull out the credential. The context clue words “user” and “password”, appearing one line after another, are strong anchor words for a regex. And these regexes work reasonably well in real production environments.
Let’s make it harder. Suppose we replace that simple file with a more complex script, backup.ps1:
And, to go further, let’s also place a file called access.txt in a user’s Desktop folder on one of the machines in the range, 192.168.4.22. The contents of this file contain the credential for another domain user, jon.snow:
With Advanced Data Pilfering, NodeZero can extract both credentials and chain them into a full attack, as shown below:
NodeZero does the following:
samwell.tarlySYSVOL using samwell.tarly‘s credential and identifies the file backup.ps1 as likely to contain a credentialjeor.mormont from backup.ps1jeor.mormont, and discovers this user is a a local admin on host 192.168.4.22jeor.mormont.C:\Users\hodor\Desktop\access.txt as likely to contain a credential.jon.snowjon.snowNodeZero’s support for LLM-assisted credential discovery isn’t limited to SMB shares and compromised hosts. It extends to other common data repositories, including AWS S3 buckets, NFS shares, and Slack.
Every pentester knows that a good report is more than a list of compromised hosts. What truly matters is conveying the business impact, which often comes down to the type of data that was accessed. Was it sensitive customer data, financial records, or intellectual property? For defenders, this same context drives a better understanding of business risk and which security weaknesses to prioritize for remediation.
With Advanced Data Pilfering, NodeZero now automatically classifies the type of data it compromises and links it to tangible business risks.
For instance, in the GOAD environment, we set up an Engineering file share on the host 192.168.4.23 containing R&D data that includes engineering schematics, source code, and legal patent documentation.
📁 \\192.168.4.23\Engineering
|
|-- 📁 Legal_IP
| |-- 📁 Patents
| | |-- 📁 Applications_Pending
| | |-- 📁 Issued
| | |-- 📁 Prior_Art_Research
| |-- 📁 Trademarks
| |-- 📁 Licensing
| | |-- 📁 Inbound_Licenses (IP we use)
| | `-- 📁 Outbound_Licenses (IP we sell)
| `-- 📁 Trade_Secrets
|-- 📁 Product_Development
| |-- 📁 Alchemy_Platform (Software)
| | |-- 📁 Architecture
| | |-- 📁 Source_Code_Snapshots
| | `-- 📁 Security_Audits
| `-- 📁 Gen4_Sensor (Hardware)
| |-- 📁 BOM (Bill of Materials)
| |-- 📁 CAD_Schematics
| |-- 📁 Firmware
| `-- 📁 QA_Test_Results
|-- 📁 Research_Lab
| |-- 📁 Project_Helios (New Battery Tech)
| | |-- 📁 Data
| | |-- 📁 Lab_Notebooks_Scanned
| | |-- 📁 Formulations
| `-- 📁 Project_Quantum_Leap (AI/ML)
| |-- 📁 Models
| |-- 📁 Training_Data
`-- 📁 Strategy
NodeZero gains access to this share after compromising the user viserys.targaryen, as shown in the attack path below:
Once it has access, NodeZero applies its smart data pipeline. It gathers file metadata and samples key files, sending this data to an LLM for deep analysis.
In this example, NodeZero determined that the share contained Intellectual Property, Manufacturing/Production data, Source Code, and Strategic Business Communications. It then automatically mapped these categories to the following business risks:
The LLM-assisted rationale justifies why the data was categorized this way, giving defenders the specific, actionable evidence they need.
With Advanced Data Pilfering, NodeZero also assesses the risk of compromised databases.
In our GOAD environment, we added a synthetic database for a medical application to one of the Microsoft SQL Servers (192.168.4.22). NodeZero compromised the service account for this server, giving it full control of the database, as shown in the attack path below:
NodeZero then extracted the metadata about the database – tables, columns, record counts, etc – and used an LLM to classify the type of data it contained.
NodeZero correctly infers that the database contains Health Data and Personal Data, as evidenced by the presence of “extensive health data including diagnoses, encounters, lab results, and medications” and “personal data including patient demographics and provider information.” It then links compromise of this database to Regulatory Breach Penalties as a business risk.
The examples in this post illustrate how Advanced Data Pilfering (ADP) enhances NodeZero’s pentesting capabilities, both with credential discovery and bridging the gap between technical exploits and real-world business risk.
These examples are representative of the types of results NodeZero is delivering in real-world tests. For instance, in a recent real-world attack path, NodeZero used ADP to compromise the domain:
NodeZero did the following:
NodeZero would go on to leverage this domain admin access to compromise the client’s Microsoft Entra tenant.
This isn’t just a theoretical threat. Real-world threat actors are doing the exact same thing. In a report from August 2025, Anthropic described a “vibe hacker” using Claude Code to facilitate ransomware attacks against at least 17 different organizations. The attacker used Claude Code to actively find credentials and, most notably, classify and analyze stolen data to weaponize it for extortion, mirroring the two core functions of ADP.
At Horizon3, we believe the future of cyber warfare will be played out at machine speed, algorithm vs algorithm, with humans by exception. If you’re a hacker interested in AI and cybersecurity and creating autonomous production-safe solutions that work at scale with no humans in the loop, we want to hear from you.