At ProjectDiscovery, we've been building Neo, an autonomous security testing platform that runs multi-agent, multi-step workflows, routinely executing 20-40+ LLM steps per task. Vulnerability assessments, code reviews, and security audits at scale, enabling continuous testing across the entire development lifecycle.
When we launched, our LLM costs were staggering. A single complex task with Opus 4.5 could consume 60 million tokens. Then we implemented prompt caching. Here's what changed:
| Metric | Before | After |
|---|---|---|
| Cache hit rate | 7% | 84% |
| Overall cost savings | baseline | -59% |
| Post-optimization savings (since Feb 16) | baseline | -66% |
| Last 10 days | baseline | -70% |
| Tokens served from cache | - | 9.8 billion |
Traditional chatbot interactions are 1-2 turns. You send a message, you get a response. The system prompt is maybe 500 tokens. Agentic systems are fundamentally different.
Take a realistic security task: scan a target's attack surface, identify exposed services, fingerprint technologies, cross-reference known CVEs, attempt exploitation paths, and produce a structured report. That's not one prompt. That's a coordinated sequence of decisions, tool calls, and intermediate findings, each building on the last.
Neo's average task runs 26 steps with 40 tool calls. System prompts are 2,500+ lines of YAML, over 20K tokens per agent. Each step re-sends the entire conversation: system prompt, tool definitions, and all prior messages. Multi-agent architectures multiply this further; our agent swarm prompt alone is 2,547 lines. Without caching, every step pays full price for the entire prefix. On a 40-step task, you're sending that 20K-token system prompt 40 times. And as the conversation grows linearly, step N re-sends everything from steps 1 through N-1.
The agentic tax: the cost of intelligence compounds quadratically with task complexity. Caching is the only structural fix.
Overall, caching saved 59% on LLM costs compared to what the same token volume would have cost at full input rates. Post-optimization that number is 66%, and the last 10 days are at 70%. These figures come from actual reported costs, not estimated pricing - we derive the effective per-token rate from real spend, then compare it against what that same volume would have cost if every input token was charged at the standard rate instead of 10% for cached reads.
Anthropic's prompt caching works by marking stable prefixes with cache_control markers. When a prefix matches a previous request, those tokens are served from cache instead of reprocessed. You get up to 4 breakpoints per request. We use three, and we had to invent a relocation trick to make them work properly.
The first breakpoint marks the last static system message - the core agent instructions that don't change between users, threads, or requests. The key decision here was using a 1-hour TTL instead of the default 5 minutes. This keeps the system prompt cache alive across users and tasks. When multiple concurrent tasks are running the same agent type, the prefix stays warm continuously during business hours.
We cover BP3 before BP2 because the order matters for understanding the architecture. BP3 marks the last static tool definition. Tools are sorted so static tools come first; dynamic per-user subagents come last. This creates a shared cache chain - [system prompt -> BP1] [static tools -> BP3] - that is identical across all users and cached for an hour.
BP2 marks the last tool result in the conversation, creating a sliding window where each new step only pays for messages added since the last breakpoint. It uses a 5-minute TTL because conversation state is per-session and changes rapidly.

Only the blocks to the right of the last BP2 marker are reprocessed each step. Everything else is served from cache.
Anthropic has a documented caveat: if your prompt has more than 20 content blocks before a cache breakpoint and you modify content earlier than those 20 blocks, you won't get a cache hit. We handle this with intermediate breakpoints every 18 blocks, which supports up to 54 content blocks - roughly 18 to 27 agent steps - before degrading to partial caching.
This was our single most impactful optimization. Anthropic's cache is strictly prefix-based - anything that changes in the middle of the prefix invalidates everything after it. Our original structure had working memory, skills context, and runtime context sitting between BP1 and BP3. Working memory changes on nearly every step. This was silently killing our cache hits.

Dynamic content appended as a user message at the tail. Changes only affect the final block. Cache hit rate: ~74%.
Our system prompts are YAML templates with variables like {{current_datetime}}, {{env_vars_list}}, and {{task_workspace}}. Rendering these with actual values would give every user a unique system prompt, destroying cross-user cache sharing at BP1. We render templates with stable placeholders instead. The actual values arrive via Runtime Context at the wire level through the relocation trick above.
We freeze the datetime once per task run and format it as date-only - no clock time. Including the current time would change Runtime Context every second, causing cache misses on the tail. Date-only keeps it stable for an entire day.
Anthropic caches are provider-specific. A request hitting Anthropic Direct and a follow-up hitting Amazon Bedrock won't share caches even if the prompts are identical. We route all traffic to Anthropic Direct first and fall back to Bedrock and Vertex only during outages. That keeps the cache pool shared across the whole user base.
When an agent makes parallel tool calls, the SDK fans each tool response into a separate wire message. Marking the tool message with cache_control causes every wire message to inherit it - potentially exhausting all 4 breakpoint slots from a single mark. We mark only the last content part of the last tool message: one breakpoint consumed instead of N.
The week of Feb 16 is when we shipped the relocation trick and three-breakpoint architecture. Cache rate jumped from under 8% to 74% overnight. Everything after that was incremental.
| Week starting | Streams | Cache rate | Notes |
|---|---|---|---|
| Feb 2 | 955 | 4.2% | |
| Feb 9 | 1,186 | 7.6% | |
| Feb 16 | 1,384 | 73.7% | relocation shipped |
| Feb 23 | 1,497 | 78.2% | |
| Mar 2 | 1,347 | 77.4% | |
| Mar 9 | 1,309 | 80.2% | |
| Mar 16 | 1,126 | 84.3% | |
| Mar 23 | 975 | 85.0% | |
| Mar 30 | 666 | 82.9% |
The more steps a task takes, the higher the cache rate. This matters because caching disproportionately helps the most expensive tasks - which is exactly the cost curve you want.
| Steps | Streams | Avg cache rate | Avg input tokens |
|---|---|---|---|
| 1 | 2,801 | 35.5% | 47,518 |
| 2-3 | 794 | 30.0% | 161,442 |
| 4-5 | 620 | 42.8% | 253,880 |
| 6-10 | 1,284 | 53.6% | 379,818 |
| 11-20 | 1,729 | 63.9% | 745,685 |
| 20+ | 3,139 | 74.0% | 3,763,263 |
The single-step dip to 35% makes sense: there's no conversation history to cache yet, so only BP1 contributes. By step 20+, BP2's sliding window is doing the heavy lifting across 3.7 million-token inputs.
| Task | Model | Input tokens | Cache rate | Steps |
|---|---|---|---|---|
| c790f4... | Opus 4.5 | 67.5M | 91.8% | 1,225 |
| a78a99... | Opus 4.5 | 57.5M | 92.9% | 1,663 |
| 8c42b5... | Opus 4.5 | 57.2M | 83.2% | 1,428 |
| 0935514e... | Opus 4.5 | 66.8M | 3.2% | - |
Task c790f4 ran 67.5 million input tokens across 1,225 steps at 91.8% cache rate. Compare that with 0935514e: nearly identical token volume, 3.2% cache rate, roughly 60x the cost. The latter ran before the optimization rollout.
Anthropic recently released automatic prompt caching, a great addition that handles breakpoint placement without any cache_control markers in your code. For many use cases, it's a solid starting point that removes a lot of the manual work.
For an agentic system like Neo, though, we needed more control. Automatic caching has no awareness of which parts of your prompt are stable versus dynamic. In Neo's case, working memory, runtime context, and per-user variables sit in the middle of the prompt and change on every step, which leads to consistent cache misses on exactly the content that benefits most from caching.
TTL control is the other piece. For shared static content like system prompts, a 1-hour TTL is what keeps the cache warm across users and tasks. Without it, a cold start every 5 minutes on a busy platform means a significant fraction of requests become cache writes rather than reads, and cache writes cost more than standard input tokens.
Automatic caching is a smart default that Anthropic has made easy to adopt. Neo's scale and complexity required going further with explicit breakpoint placement and deliberate TTLs, which is what took us from single-digit hit rates to 84%.
Every percentage point improvement in cache hit rate reduces what it costs to run a task through Neo. Going from 7% to 84% means that the most complex security audits, the ones that used to be orders of magnitude more expensive per run, are now economically viable to run repeatedly. And because those savings flow directly to the teams using Neo, what was once a one-off assessment can now become a continuous part of how security work gets done, at scale, for organizations that can't afford to slow down.
If you're building an agentic platform and haven't prioritized prompt caching yet, start here. It's the highest-ROI infrastructure change we've shipped.
If you want to see what Neo can do for your security workflows, request a demo.
cache_control syntax, TTL behavior, breakpoint limits, and supported models.Neo is ProjectDiscovery's autonomous security testing platform. This is the first post in our engineering blog series covering the technical decisions behind Neo.