Beyond Recon: Using AI for Real Exploitation in Pentesting

How context engineering and AI agents unlock real exploitation — not just recon — in penetration testing, red teaming, and bug bounty hunting.

Press enter or click to view image in full size

The “I Used AI for Pentesting” Fallacy

The security community has been buzzing with AI-assisted pentesting content for the past year. Blog posts, conference talks, YouTube videos — all of them showing AI doing impressive things. But look closely at what they’re actually doing, and a pattern emerges.

“I used AI to enumerate subdomains.” “I fed my nmap output to ChatGPT.” “I used an LLM to help me write a recon script.” “I found this asset with AI.”

It’s all recon. All surface discovery. Nobody is talking about using AI in the actual exploitation phase — the moment you’ve found a target endpoint and need to figure out if it’s actually vulnerable and how to prove it.

The real question — whether you’re running a red team engagement, chasing bug bounties, or doing a client pentest — is whether AI can help not just find the attack surface, but actually exploit it. The answer is yes, but only if you stop treating LLMs like a smarter grep and start thinking about context engineering.

The Root Problem: What You Give the LLM Determines What You Get

The most common approach looks something like this. You’re working through a web application, you capture an interesting request in Burp Suite, and you paste it into ChatGPT, Claude, or whichever LLM you prefer:

GET /api/orders/573 HTTP/1.1
Host: target.com
Authorization: Bearer eyJhbGciOiJIUzI1NiJ9...

“Find vulnerabilities in this request.”

The LLM dutifully responds with something like: “This endpoint might be vulnerable to IDOR. Try accessing other order IDs like 572 or 574 to see if you can access another user’s data.”

Technically correct. Completely useless.

Here’s what the LLM doesn’t know when it sees that request: Is 573 an order ID or a user ID? Does it map to a sensitive object? Does another user in this application own order 574? Is there an admin role that can access all orders regardless of ownership? What other endpoints exist that might expose the same object? What fields does the Order object contain?

Without this context, the LLM is pattern-matching against its training data, not reasoning about your specific application. It knows that numeric IDs in URL paths are sometimes associated with IDOR vulnerabilities. That’s it. The output is generic because the input is generic.

This is not an LLM limitation. LLMs are genuinely capable of sophisticated security reasoning. It is a context engineering problem — and it’s entirely solvable.

Context Engineering: Understanding the Application Before Attacking It

If you want an LLM to reason about exploitation rather than pattern-match against vulnerability classes, you need to give it semantic context about the application. Not raw HTTP requests — semantic understanding.

This is different from prompt engineering in the narrow sense — it is not about rephrasing your question. It is about what model of the world you hand the LLM before you ask it anything.

What does that mean in practice? Think about what a skilled pentester builds in their head as they browse an application:

Objects are the data entities the application manages. User, Order, Product, Invoice, Report. Every request is doing something to one of these objects.

Roles are the privilege levels defined in the application. Admin, editor, moderator, guest, API key user. They define who is allowed to do what.

Functions are what endpoints actually do, semantically. Not “PUT /api/posts/45” — but “Update Post.” Not “GET /api/admin/users” — but “List All Users (admin function).”

ID relationships are the mappings between identifiers and their owners. Order 573 belongs to User A. Order 498 belongs to User B. This is the raw material for access control testing.

…

A skilled pentester builds this model in their head over the course of an engagement. The insight is that you can build this model programmatically and feed it to an LLM — and when you do, the quality of exploitation reasoning improves dramatically.

The key property of this context is that it is not pre-populated. It accumulates progressively as you browse. The first request you capture tells you almost nothing. By the tenth request, you know the object model. By the thirtieth, you know the role hierarchy, the sensitive fields, and the ID ranges. Every request adds to the knowledge base, and later analysis benefits from everything that came before.

Browser Traffic
      |
      v
+-------------+
|   Analyzer  |  <- Extracts: objects, roles, functions, IDs
+------+------+
       |
       v
+-----------------+
|  Context Store  |  <- Accumulates knowledge across all requests
+------+----------+
       |
       +--------------------------------------+
       v                                      v
+-------------+                    +------------------+
| Tester IDOR |                    | Tester AuthZ     |
+-------------+                    +------------------+
       |                                      |
       v                                      v
  "User A owns                    "Can User B access
   order IDs:                      admin endpoints?"
   573, 574, 601"

The Analyzer runs first on every captured request. Its job is not to find vulnerabilities — it is to extract semantic meaning. What object is this request operating on? What role does the authenticated user appear to have? What function is being performed? Are there any IDs in the request that suggest ownership relationships?

The Context Store accumulates everything the Analyzer has extracted, building a growing model of the application across the entire session.

Only after the Context Store has enough information do the specialized testers run — and when they do, they run with that accumulated context baked into their prompts.

This distinction matters. An IDOR tester that doesn’t know which IDs belong to which user can only give you generic advice. An IDOR tester that knows “User A owns order IDs 573, 574, and 601 while User B owns IDs 498 and 512” can give you a specific test: authenticate as User B and request /api/orders/573. That's the difference between reconnaissance-grade output and exploitation-grade output.

Agent-Based Exploitation Flow

Understanding the architecture conceptually is one thing. Implementing it as a working agentic AI system means thinking carefully about how to structure each AI agent and what context it receives.

Get Serhat ÇİÇEK’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

The temptation is to build one large prompt: “Here is a captured HTTP request. Here is the session context. Find all vulnerabilities.” This fails for the same reason that asking one consultant to simultaneously specialize in network security, application security, cryptography, and social engineering fails. Depth requires focus.

When you ask a single LLM instance to simultaneously reason about IDOR, SQL injection, authorization bypass, business logic flaws, and SSRF, the context window fills with considerations from every vulnerability class. The model spreads attention across all of them and goes deep on none. The outputs are generic.

The orchestrator pattern solves this. An orchestrator agent receives each captured request along with the current session context. It doesn’t try to find vulnerabilities itself — it decides which specialized agents to launch based on the characteristics of the request.

Request has numeric ID in path?
  -> Launch IDOR agent with: [object type, known IDs per credential, ID range]Request is to login/SSO endpoint?
  -> Launch AuthN agent with: [auth scheme, JWT structure if present]
Request has POST body with JSON fields?
  -> Launch Mass Assignment agent with: [known object fields, sensitive field list]
Multiple credentials available?
  -> Launch AuthZ agent with: [role hierarchy, endpoint, all credentials]

Each specialized agent receives only the context relevant to its vulnerability class. The IDOR agent doesn’t need to know about the JWT structure. The AuthN agent doesn’t need to know about the object model. Focused context produces focused reasoning.

The key insight is the shape of the question you’re giving each agent. “Find vulnerabilities in this request” is a broad, lazy question. “Can credential B access this admin endpoint that was observed to be accessible by credential A, given the role hierarchy we’ve established?” is a focused question with a yes/no answer and a clear test procedure. Focused questions produce actionable outputs.

Practical Scenario: CMS Privilege Escalation

Theory aside, let’s walk through what this actually looks like in practice. The target is a CMS application. You have two accounts: an editor (low privilege) and an admin (high privilege). The goal is to discover privilege escalation paths.

Request 1: Editor logs in.

POST /api/auth/login HTTP/1.1
Host: cms.target.com
Content-Type: application/json

{"username": "[email protected]", "password": "..."}

The Analyzer extracts a User object, a JWT token in the response, and a role claim: "role": "editor". The privilege level is inferred as low. The Context Store now knows that an editor role exists, what the JWT structure looks like, and that the auth scheme is Bearer token.

Request 2: Editor browses to their drafts.

GET /api/posts/drafts HTTP/1.1
Authorization: Bearer <editor-token>

The Analyzer extracts a Post object with fields: id, title, content, status, author_id. The status field is flagged as potentially sensitive — it controls publication state. The Context Store now knows that the editor can read drafts, and has a model of the Post object.

Request 3: Editor updates a post.

PUT /api/posts/45 HTTP/1.1
Authorization: Bearer <editor-token>
Content-Type: application/json

{"title": "New Title", "content": "..."}

This request is the trigger. The Analyzer extracts a Post update function and flags the status field as sensitive given what the Context Store already knows about it.

Finding: Mass Assignment on the status field.

The Mass Assignment agent launches with its own focused context: Post object fields, sensitive field list, the PUT endpoint. Its question: “Can an editor inject privileged fields through the Post update endpoint?” It tests injecting "status": "published" — bypassing the editorial workflow — and confirms the server accepts it.

Request 4: You authenticate as admin separately and browse to the user management panel.

GET /api/admin/users HTTP/1.1
Authorization: Bearer <admin-token>

The Analyzer extracts the admin role, notes the /api/admin/ namespace pattern in the URL, and identifies a User management function. The Context Store now knows that an admin role exists with higher privilege, and that admin functions follow the /api/admin/ URL pattern.

Finding: Broken Access Control on /api/admin/users. (OWASP A01:2021 — Broken Access Control, the top web application security risk.)

The AuthZ agent launches with full context: editor credential, admin credential, known role hierarchy, known admin endpoint pattern. Its focused question: “Can the editor credential access /api/admin/users?" It constructs the test, executes it, and gets back a 200 OK.

Neither of these findings would have surfaced by dumping individual requests into an LLM. The access control finding required knowing that admin endpoints follow the /api/admin/ pattern — knowledge that only existed because Request 4 had been analyzed and stored. The mass assignment finding required cross-referencing two separate requests: the status field was discovered in Request 2, and the write endpoint was discovered in Request 3. Without a Context Store connecting them, there is nothing to cross-reference.

This is the difference between an LLM acting as a stateless pattern-matcher and an LLM acting as a contextually aware exploitation assistant. The underlying model capability is the same. The context engineering is what changes the outcome..

Closing: The Missing Piece

The gap in the security community’s use of AI is not about what LLMs can do. Modern LLMs are genuinely capable of sophisticated security reasoning — understanding authorization models, identifying trust boundary violations, recognizing business logic flaws. That capability exists and is underused.

The gap is about how we use LLMs. Treating an LLM as a smarter grep — feeding it raw requests and expecting exploitation guidance — ignores everything that makes LLMs actually powerful: their ability to reason about relationships, hierarchies, and semantic meaning across a complex system.

Context engineering is the missing piece. Build a model of the application as you browse. Extract objects, roles, functions, and ID ownership. Accumulate this context progressively. Then give specialized agents focused questions backed by that accumulated knowledge.

Cyberstrike is an open-source tool that implements exactly this architecture for web proxy testing. It uses a browser extension to capture traffic, a proxy-analyzer agent to build session context progressively, and specialized sub-agents for each vulnerability class. If you want to see these concepts working in practice rather than building from scratch, it’s a solid reference implementation.

Documentation and source: https://github.com/CyberStrikeus/CyberStrike

The community will keep producing “I used AI for recon” content, and some of it will be genuinely useful. But if you’re a penetration tester, red teamer, or bug bounty hunter looking for AI-powered security testing that actually reaches the exploitation phase, the answer is not a better prompt. It is a better architecture.

Stop treating LLMs like a smarter grep. Start treating them like a reasoning engine that needs the same context a skilled human tester builds up over the course of an engagement. The results will follow.