TL;DR: Most LLM security testing today relies on static prompt checks, which miss the deeper risks posed by conversational context and adversarial manipulation. In this blog, we focus on how real pen testing requires scenario-driven approaches that account for how these models interpret human intent and why traditional safeguards often fall short.
If you're exploring how to secure large language models (LLMs), don't miss our webcast: Breaking AI: Inside the Art of LLM Pen Testing. It goes deeper into the techniques, mindsets, and real-world lessons covered in this blog.
As more organizations integrate LLMs into their products and systems, security teams are facing a tough question: how do you test these systems in a meaningful way?
At Bishop Fox, our consultants have been digging deep into this space. We’ve been testing real-world deployments, learning from failures, and adjusting our approach. Recently, we got the team together to talk through what’s working, what isn’t, and how security testing needs to evolve. This post shares some of the insights from that conversation.
A lot of teams start with prompt engineering. They try to test LLMs by hardcoding prompts or adding token filters during development. That’s helpful for catching obvious stuff, but it doesn’t really tell you whether the system is secure.
However, LLMs aren’t like traditional software. They respond based on context, not just code. They’re trained on human conversation, which means they inherit all the unpredictability that comes with it. Slight changes in tone or phrasing can lead to very different outputs.
To test them properly, you need to think like an attacker. You need to understand how people can twist language, shift intent, and use context to get around rules.
For example, one of our consultants ran into a safety policy that said, “Children should not play with fire.” He asked the model to rewrite the rule for adults, with the logic that adults shouldn’t have the same limitations. The model flipped the meaning and suggested, “You should play with fire.” A small change, but a clear failure.
These kinds of attacks work because LLMs try to be helpful. They don’t just follow rules; they interpret intent. That’s where the real risk lies.
We must treat LLMs as conversational systems, not command-line tools. Therefore, testing must focus on realistic scenarios that mimic how a real user or attacker might try to misuse the model.
Filtering outputs or adding rate limits isn’t enough. If you want to build a secure system around an LLM, you need to treat it like any other high-risk component.
Here’s what our team recommends:
It’s the same layered approach we use elsewhere in security, just applied to generative systems.
Here’s a challenge: some attacks won’t happen the same way twice.
That’s not necessarily a failure of testing. It’s a byproduct of how these models work. The same input can produce different results depending on timing, previous messages, or even slight variations in phrasing.
When that happens, we document everything. We provide full transcripts, the surrounding context, and our recommendations for how to avoid similar issues. We focus on why it worked – not just what happened.
This space is changing fast. What works today might not work next week. But here’s what we’re seeing success with right now:
It’s not about perfection. It’s about being flexible, aware, and focused on actual impact.
If you’re responsible for shipping or securing AI systems, here are a few things to keep in mind:
We’re actively testing and learning. If you’re on a similar path, we’d love to compare notes.
Have a story to share? Hit a weird edge case? We’d love to connect you with someone on our team who’s knee-deep in this work.