This article summarizes our investigation into jailbreaking 17 of the most popular generative AI (GenAI) web products that offer text generation or chatbot services.
Large language models (LLMs) typically include guardrails to prevent users from generating content considered unsafe (such as language that is biased or violent). Guardrails also prevent users from persuading the LLM to communicate sensitive data, such as the training data used to create the model or its system prompt. Jailbreaking techniques are used to bypass those guardrails.
The goals of our jailbreak attempts were to assess both types of issues.
Our findings provide a more practical understanding of how jailbreaking techniques could be used to adversely affect end users of LLMs. We did this by directly evaluating the GenAI applications and products that are in use by consumers, rather than focusing on a specific underlying model.
We hypothesized that GenAI web products would implement robust safety measures beyond their base models' internal safety alignments. However, our findings revealed that all tested platforms remained susceptible to LLM jailbreaks.
Key findings of our investigation include:
Given the scope of this research, it was not feasible to exhaustively evaluate every GenAI powered web product. To ensure we do not create any false impressions about specific providers, we have chosen to anonymize the tested products mentioned throughout the article.
It is important to note that this study targets edge cases and does not necessarily reflect typical LLM use cases. We believe most AI models are safe and secure when operated responsibly and with caution.
While it can be challenging to guarantee complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that can help monitor when and how employees are using LLMs. This becomes crucial when employees are using unauthorized third-party LLMs.
The Palo Alto Networks portfolio of solutions, powered by Precision AI, can help shut down risks from the use of public GenAI apps, while continuing to fuel an organization’s AI adoption. The Unit 42 AI Security Assessment can speed up innovation, boost productivity and enhance your cybersecurity.
If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.
Related Unit 42 Topics | Prompt Injection, GenAI |
Many web products have incorporated LLMs in their core services. However, they can generate harmful content if not properly controlled. To mitigate this risk, LLMs are trained with safety alignments to prevent the production of harmful content.
However, these safety alignments can be bypassed through a process called LLM jailbreaking. This process involves crafting specific prompts (known as prompt engineering or prompt injection) to manipulate the model's output, and it leads the LLM to generate harmful content.
Generally, LLM jailbreak techniques can be classified into two categories:
Our LIVEcommunity post Prompt Injection 101 provides a list of these strategies.
People’s goals when attempting a jailbreak will vary, but most relate to AI safety violations. Some aim to extract sensitive information from the targeted LLM, such as model training data or system prompts.
Our Prompt Injection 101 post also includes a list of common jailbreak goals.
In this study, we focused on the following jailbreak goals:
Many existing works evaluate the impact of LLM jailbreaks.
These existing research articles provide valuable information on the possibility and effectiveness of in-the-wild LLM jailbreaks. However, they either focus solely on violating safety goals or discuss a specific type of sensitive information leakage. In addition, these evaluations are mostly model-oriented, meaning that the evaluation is performed against a certain model.
In this study, our goals are:
We evaluated 17 apps from the Andreessen Horowitz (aka a16z) Top 50 GenAI Web Products list, focusing on those offering text generation and chatbot features. The data and findings presented in this study were effective as of Nov. 10, 2024.
We evaluated each application using its default model to simulate a typical user experience.
All the target apps provide a web interface for interacting with the LLM. However, with only access to the interface, it is challenging to test the target app at scale. For example, this testing could include using automated LLM jailbreak tools, as done in various previous research studies (e.g., h4rm3l [PDF], DAN in the wild jailbreak prompts [PDF]).
Due to this limitation, we relied on manual effort for testing the target apps. In our evaluation, we assessed each app against the goals defined in Table 1 in the next section. For each goal, we applied both single-turn and multi-turn strategies.
After we obtained responses from the target apps, we manually checked the response to determine if the attack was successful. Finally, we computed the attack success rate (ASR) on each goal and strategy we tested.
The strategies we chose for our experiments are well-known jailbreak techniques that have been extensively explored in previous research and studies. We classified these strategies into single-turn and multi-turn categories based on the number of interaction rounds required to complete a jailbreak task. According to existing literature, multi-turn strategies are generally considered more effective than single-turn approaches for achieving AI safety violations jailbreak goals.
Due to the manual nature of our testing process and the greater variety of single-turn strategies available in the literature compared to multi-turn strategies, we focused on two multi-turn strategies that we observed to be the most effective, while maintaining a broader range of single-turn approaches. However, we note that this selective sampling of multi-turn strategies may introduce bias into our comparative analysis. Since we specifically chose the two most effective multi-turn strategies while testing a wider range of single-turn approaches, our results regarding the relative effectiveness of multi-turn versus single-turn strategies should be interpreted with this limitation in mind.
The attack success rate (ASR) is a standard metric used to measure the effectiveness of a jailbreak technique. It is computed by dividing the number of successful jailbreak attempts (where the model provides the requested restricted output) by the total number of jailbreak attempts made across all prompts. In our context, this ASR computation is performed on each strategy and goal. When we compute ASR for a given goal, we accumulate all the successful jailbreak prompts across all the apps and strategies, and divide it by the total amount of prompts. Similarly, for a given strategy, we accumulate all successful jailbreak attempts across all apps and goals, and divide by the total number of attempts made with that strategy.
We compiled a diverse set of single-turn prompts from existing research literature to test various jailbreak techniques. These prompts fall into six main categories:
Our experiments employ two multi-turn strategies:
The crescendo technique is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak.
The Bad Likert Judge jailbreaking technique manipulates LLMs by having them evaluate the harmfulness of responses using a Likert scale, which is a measurement of agreement or disagreement toward a statement. The LLM is then prompted to generate examples aligned with these ratings, with the highest-rated examples potentially containing the desired harmful content.
Table 1 shows the results of our testing. The column headers show the apps we tested, and the row headers show the jailbreak goals. For each goal, we split the tests between single-turn and multi-turn strategies.
If a jailbreak attempt successfully achieves a given goal on the target application, we mark it as ✔. Conversely, if the attempt fails, we mark it as ✘.
Figure 1 presents the ASR comparison between single-turn and multi-turn strategies across 17 apps. For each app, we tested 8 goals using 8 different strategies (6 single-turn and 2 multi-turn). For each strategy, we created 5 different prompts using that strategy, and then replayed each prompt 5 times. This results in a total of 25 attack attempts per strategy. For single-turn attacks, the ASR was calculated by dividing the number of successful attempts by 2,550 prompts (17 apps × 6 strategies × 25 prompts). For multi-turn attacks, the ASR was calculated by dividing the number of successful attempts by 850 prompts (17 apps × 2 strategies × 25 prompts).
Based on the results, we have the following observations:
Figure 2 presents the ASR of the tested single-turn strategies. The results indicate that storytelling is the most effective strategy (among both single-turn and multi-turn) across all tested GenAI web applications. Its ASRs range from 52.1% to 73.9%. It achieves its highest effectiveness in malware generation scenarios. Role-play follows as the second most effective approach (across strategies of both types), with success rates between 48.5% and 69.9%.
In addition, we found that previously effective jailbreaking techniques like DAN have become less effective, with ASRs ranging from 7.5% to 9.2% across different goals. This significant decrease in effectiveness is likely due to enhanced alignment measures [PDF] in current model deployments to counter these known attack strategies.
One strategy that has a very low ASR is the repeated token strategy. This involves requesting that the model generate a single word or token multiple times in succession. For example, one might have the model output the word “poem” repeatedly 100,000 times. The technique is mainly used to leak model training data (this Dropbox blog on repeated token divergence attacks has further details). In the past, the repeated token strategy has been reported to leak training data from popular LLMs, but our results show that it is no longer effective on most of the tested products, with only a 2.4% success rate in training data leakage attempts and 0% across all other goals.
Figure 3 shows the comparative effectiveness of multi-turn strategies across all the jailbreak goals. Overall, the result indicates that the Bad Likert Judge technique has slightly higher success rates compared to the Crescendo attack. When comparing the ASR across AI safety violation goals, Bad Likert Judge achieves an ASR of 45.9%, while Crescendo shows a slightly lower ASR of 43.2%. The difference is most noticeable in the goal of malware generation, where Bad Likert Judge achieves a 56.7% success rate compared to Crescendo's 52.5%. In addition, it’s worth noting that only Bad Likert Judge had limited success in the system prompt leakage goal, while Crescendo failed to leak any system prompt.
For the data leakage goals (training data and PII), both multi-turn strategies proved completely ineffective, with 0% ASR.
Our experimental results revealed that both single-turn and multi-turn strategies had very limited effectiveness in leaking system prompts, with overall success rates of 1.6% and 0.2% respectively (see the system prompt column in Figure 1). Among all single-turn strategies tested, only the instruction override strategy was able to leak system prompts, achieving a 9.9% success rate (see Figure 2). For multi-turn approaches, while the Bad Likert Judge strategy showed minimal success with a 0.5% ASR, we weren’t able to leak system prompts by using the Crescendo strategy. Figure 4 shows an example of a prompt that uses the instruction override strategy to leak the application's system prompt.
Overall, while the ASR is notably low on the system prompt leakage goal, the results demonstrate that some tested GenAI apps are still vulnerable to system prompt leakage, particularly when attacked by prompts using the instruction override strategy.
A March 2024 Dropbox blog discussed a significant vulnerability regarding training data leakage affecting mainstream LLMs. The associated exploit method was as simple as prompting the target LLM to generate repeated tokens, as mentioned in the Single-Turn Strategy Comparison section.
Our findings indicate that this technique is no longer effective against most tested apps, with one exception (App14). We show a training data leakage example from this app in Figures 5 and 6.
In this particular case, after repeating the character A several thousand times, the model began outputting content from a webpage, similar to the behavior observed in the original report.
We followed the link shown in the response highlighted in the red box in Figure 5. We confirmed that the target model had indeed incorporated content from MathPlanet's webpage about strings in its training data as shown in Figure 6.
Our investigation into the popular GenAI web products reveals that they are vulnerable to LLM jailbreaks.
Our key takeaways are:
Based on our observations in this study, we found that the majority of tested apps have employed LLMs with improved alignment against previously documented jailbreak strategies. However, as LLM alignment can still be bypassed relatively easily, we recommend the following security practices to further enhance protection against jailbreak attacks:
But we do note that while the content filtering can effectively mitigate broader types of jailbreak attacks, they are not infallible. Determined adversaries may still develop new techniques to bypass these protections.
While it can be challenging to guarantee complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that can help monitor when and how employees are using LLMs. This becomes crucial when employees are using unauthorized third-party LLMs.
The Palo Alto Networks portfolio of solutions, powered by Precision AI, can help shut down risks from the use of public GenAI apps, while continuing to fuel an organization’s AI adoption. The Unit 42 AI Security Assessment can speed up innovation, boost productivity and enhance your cybersecurity.
If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:
Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.