HiddenLayer this week disclosed its researchers have discovered a prompt injection technique that bypasses instruction hierarchy and safety guardrails across all the major foundational artificial intelligence (AI) models provided by OpenAI, Google, Anthropic, Meta, DeepSeek, Mistral and Alibaba.
Company CEO Chris Sestito said the HiddenLayer researchers were able to employ a combination of an internally developed policy technique and roleplaying to generate outputs that violate policies pertaining to chemical, biological, radiological, and nuclear research, mass violence, self-harm and system prompt leakage.
Specifically, HiddenLayer reports that a previously disclosed Policy Puppetry Attack can be used to reformulate prompts to look like one of a few types of policy files, such as XML, INI, or JSON, that can be used to trick a large language model (LLM) into subverting alignments or instructions. That approach would enable a cybercriminal to bypass system prompts and any safety alignments trained into the models.
The disclosure of these AI vulnerabilities coincides with an update to the HiddenLayer platform for securing AI models that, in addition to providing an ability to track the genealogy of models, can also be used to create an AI Bill of Materials (AIBOM).
Additionally, version 2.0 of the company’s AIsec Platform is now able to aggregate data from public sources like Hugging Face to surface more actionable intelligence on emerging machine learning security risks.
Finally, AIsec Platform 2.0 also provides access to updated dashboards that enable deeper runtime analysis by providing greater visibility into prompt injection attempts, misuse patterns, and agentic behaviors.
In the near term, HiddenLayer is also working toward adding support for AI agents built on top of the AI models its platform currently helps secure, noted Sestito.
In general, it’s apparent that providers of AI models are much more focused on performance and accuracy than they are on security, said Sestito. AI models, despite the guardrails that might have been put in place, are inherently vulnerable, he added.
That issue is only going to become even more problematic once AI agents are authorized to access data, applications and services at scale, noted Sestito. Those AI agents are, in effect, new types of identities that cybercriminals will undoubtedly find ways to compromise, he added.
Despite those concerns, however, organizations are continuing to deploy AI technologies in ways that cybersecurity teams will eventually be called upon to secure, said Sestito.
AI is not the first emerging technology that cybersecurity teams have been asked to help secure after it has already been adopted, but the level of potential damage that might be inflicted by a breach of an AI model or agent could easily become catastrophic. While there is greater awareness of this issue today than there was this time last year, it’s apparent that much work securing AI technologies needs to be done.
The challenge, of course, is that the number of cybersecurity professionals who have any AI expertise is limited, with the number of AI professionals willing to focus on cybersecurity concerns being even less. As such, it may not be a question of whether or not there will be major AI security incidents so much as it is to what degree of harm might be inflicted before more attention is paid.
Recent Articles By Author