TL;DR: LLMs demonstrate significant potential to streamline vulnerability research by automating patch diff analysis. Our structured experiments across high-impact CVEs show that targeted LLM applications can reduce analysis time while maintaining accuracy, with Claude Sonnet 3.7 emerging as the optimal balance of performance and cost efficiency
Vulnerability researchers spend countless hours performing patch diffing, the meticulous process of comparing software versions to identify security fixes and understand attack vectors. This becomes particularly challenging when security advisories provide incomplete information, forcing analysts to manually decompile binaries and sift through thousands of code changes to locate the actual vulnerability.
Traditional patch diffing workflows involve decompiling affected and patched binaries, generating differential reports, and manually analyzing each function change. For complex applications, this process can consume days or weeks of expert time, creating bottlenecks in vulnerability research pipelines.
This blog details our recent research into how LLMs can assist in scaling patch diffing workflows, saving valuable time in a crucial race against attackers. For full details on methodology and results, download our LLM-Assisted Vulnerability Research Guide.
Our research employed a structured methodology targeting four high-impact vulnerabilities with CVSS scores of 9.4 or higher within the following vulnerability classes: information disclosure, format string injection, authorization bypass, and stack buffer overflow.
For each test case, we used traditional automation to decompile the affected and patched binaries (using Binary Ninja), generate a differential report (using BinDiff), and extract the decompiled (medium-level instruction language) code for changed functions only.
We gave the LLMs two prompts.
We then reviewed the results from each test to determine the placement of the known vulnerable function. In cases where multiple functions were associated with the vulnerability, we used whichever one had the best placement.
To ensure statistical reliability, we utilized three LLM models (Claude Haiku 3.5, Claude Sonnet 3.7, and Claude Sonnet 4) across multiple iterations.
The majority of test cases (66%) produced a known vulnerable function within the Top 25 ranked results, suggesting that the overall approach has great promise.
Claude Haiku 3.5 |
Claude Sonnet 3.7 |
Claude Sonnet 4 |
|
INFO DISCLOSUREChanged functions: 27Known vulnerable: 2 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 23 minsAvg. cost: $0.72 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 8 minsAvg. cost: $2.71 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 39 minsAvg. cost: $2.70 |
FORMAT STRINGChanged functions: 134Known vulnerable: 1 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 5 minsAvg. cost: $0.43 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 6 minsAvg. cost: $1.71 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 7 minsAvg. cost: $1.66 |
AUTH BYPASSChanged functions: 1424Known vulnerable: 2 |
Top 5: 7/9 (78%)Top 25: 8/9 (89%)Avg. time: 39 minsAvg. cost: $9.26 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 42 minsAvg. cost: $35.52 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 412 minsAvg. cost: $34.71 |
STACK OVERFLOWChanged functions: 708Known vulnerable: 1 |
Top 5: 0/9 (0%)Top 25: 7/9 (78%)Avg. time: 22 minsAvg. cost: $5.77 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 28 minsAvg. cost: $22.30 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 88 minsAvg. cost: $21.79 |
To better understand where this approach excels and where it falters, let’s take a closer look at each test case individually.
Overall, benchmarking these four real-world vulnerabilities with our LLM-powered workflow showed that most patch differential analyses stand to benefit from a first pass by artificial intelligence. This type of analysis is typically performed under extremely time-sensitive conditions, so if an LLM can guide analysts toward vulnerable code faster than usual, without incurring excessive costs, then it’s worth doing.
Perhaps unsurprisingly, our research indicated that more powerful models tend to perform better but can increase analysis time. Balancing cost and speed remain an important consideration. We also found that the LLMs fail to produce helpful results in some cases, especially when minimal information about a vulnerability is available. In the end, though, our methodology proved to be largely successful and further refinement of the workflow is warranted.
The primary takeaway from this experiment was that LLMs have great potential for improving vulnerability research, especially when used in a targeted (rather than holistic) approach. We at Bishop Fox have already begun incorporating these LLM-assisted techniques into our standard research methodology and continue experimenting with both targeted and holistic approaches.
Subscribe to our blog and advisories
Be first to learn about latest tools, advisories, and findings.