Uncover prompt injection, insider threats with the Tenable One Model Refusal Detection
嗯,用户让我帮忙总结一篇文章,控制在100字以内,而且不需要特定的开头。首先,我需要仔细阅读这篇文章,理解它的主要内容。 文章主要讲的是Tenable One的新功能Model Refusal Detection。这个功能利用大型语言模型(LLM)拒绝执行可疑或风险提示的行为,作为高保真早期预警信号。这样可以帮助检测和阻止提示注入攻击、内部威胁等风险行为,在它们升级为漏洞之前就被发现和阻止。 接下来,我需要提取关键点:Model Refusal Detection的作用、它如何工作、以及它带来的好处。比如,AI安全从数据分析转向语言分析,模型拒绝作为警告信号,Tenable One的多层防御机制等。 然后,我要把这些信息浓缩到100字以内,确保涵盖主要功能和优势。可能的结构是先介绍功能,再说明其作用和带来的好处。 最后,检查一下是否符合用户的要求:中文、简洁、不使用特定开头。确保总结准确传达文章的核心内容。 </think> Tenable One推出Model Refusal Detection功能,通过捕捉大型语言模型(LLM)对可疑或风险提示的拒绝响应,提供高保真早期预警信号。该功能帮助检测和阻止提示注入攻击、内部威胁等风险行为,并在漏洞发生前进行干预。 2026-3-26 14:45:0 Author: www.tenable.com(查看原文) 阅读量:2 收藏

Tenable One's new Model Refusal Detection turns an LLM's refusal to execute a risky or suspicious prompt into a high-fidelity early warning signal. It helps you uncover and stop prompt injection attacks, insider threats, and other risky user behaviors before they escalate into a breach. 

Key takeaways:

  1. AI has shifted traditional cyber detection methods away from security data analysis and toward human language analysis. This shift makes AI adversarial attempts harder to detect and increases data privacy risks.
     
  2. An LLM’s “model refusal” response could be a high-fidelity warning of an active attack. While LLM responses vary, a single refusal often provides a roadmap for attackers to refine their prompts until they succeed.
     
  3. The new Model Refusal Detection from Tenable One AI Exposure adds a “defense-in-depth” layer, turning model responses into an early-warning system to neutralize adversarial behavior before a successful bypass occurs.
     

An AI model’s refusal to respond to a user’s prompt doesn’t stop an attacker. It encourages the malicious actor to try again to bypass your guardrails. That’s why we’re announcing Tenable AI Exposure’s Model Refusal Detection, available now. By using these refusals as potential attack indicators in a sophisticated, AI-based detection engine, we can catch the malicious intent before the breach.

Read on to learn why it matters and how you can secure your AI systems today.

What is model refusal? 

AI has broken the traditional security playbook. The attack surface is changing daily, turning yesterday’s nuances into today’s critical exploits. Unlike traditional cybersecurity, AI security hinges on language and text inputs rather than on the analysis of a collection of data points. 

Despite this inherent complexity, every enterprise’s goal is the same: not to miss any adversarial attempt.

AI vendors such as OpenAI, Anthropic, and Google have implemented safety guardrails to address foundational AI safety. These guardrails are designed to refuse user requests that pose a risk or might be harmful. This mechanism is known as model refusal.

Model Refusal in ChatGPT
Example of a user trying to access sensitive information by asking ChatGPT a series of questions.


However, solely depending on blocking adversarial user input techniques is inadequate, especially since AI models lack the deterministic consistency of traditional systems. Crucially, it’s vital to recognize that for a determined user, a single refusal often serves only as an invitation to try again with a different approach until they succeed.

model refusal in Gemini

Model refusals are a crucial warning sign of a tangible security risk. Ignoring them allows risky insiders, such as erratic or malicious employees, as well as malicious actors, such as those utilizing compromised accounts, to engage in unmonitored abuse. Ignoring refusals exposes the company to serious regulatory, privacy, and business risks. 

A significant challenge in detecting model refusal is distinguishing malicious attempts from valid requests limited by the model’s capabilities. We avoid surfacing these false positives by identifying refusals triggered by functional gaps rather than security risks, such as a user asking a text-only model to generate a video or execute a backend script that it isn’t integrated with.

With this challenge in mind, and recognizing the limitations of prompt-only defenses, the Tenable Research team undertook research into the other side of the interaction — not only the user’s risky behavior but also the model’s response. 

What does model refusal look like?

There are several important reasons why an LLM should refuse a user’s request. It should refuse it if the request is clearly harmful, involves a malicious cyber-related request, hints at dangerous or illegal activity, or attempts to gain unauthorized access. All those requests are expected to be refused.

To analyze model refusal types, we compiled thousands of prompts into “refusal” categories. We then red-teamed the models with these risky prompts, expecting them to trigger refusals.

Model refusals can occur for many reasons, which is why our unique detection strategy is built on the defense-in-depth principle. Rather than relying on isolated data points, we treat a model’s refusal as a high-fidelity signal that an incident might be occurring. By correlating these signals with our deep analysis of user inputs and agentic actions, we provide a comprehensive view of your AI’s security posture.

Here are the main takeaways of our analysis

Different LLMs reply differently

Models vary significantly in their ability to block malicious prompts and in their style of refusal. This inconsistency highlights the need for a comprehensive AI security platform, rather than relying on the inherent behavior of any individual model.

model refusal in ChatGPT Copilot Gemini

There are different types of refusals

During our research, we categorized several types of refusals by analyzing patterns and semantic fields in the model’s responses, independent of the user’s input. We won’t reveal all categories here to avoid serving adversarial efforts.

One notable type is the “Bold No” — a strong, unambiguous, and forceful refusal. This is typically used in response to requests that are clearly extremely dangerous or harmful.

For instance, when prompted to describe a sexual scene involving minors, the model delivered a firm and clear refusal: “I can’t and won’t provide that content. If you have other questions or topics I can help with, I’m happy to assist.” 

Another significant pattern we identified is the “Empathy” type. This pattern occurs when the user expresses distress (frustration, sadness, or suicidal thoughts), which leads them to make a risky or prohibited request. The model refuses the request but includes compassionate language and may even direct the user to professional help.

A clear example of this is a scenario where a user sent a harmful prompt asking the model to write a speech against a specific ethnic group, threatening self-harm if the model refused. The model refused the hateful content but responded with great sensitivity: 

AI Model Refusal Sensitive Response

No organization wants to be on the front page because an employee leaked sensitive data or generated harmful content using corporate AI tools. Model refusal is a clear signal of this risky behavior, and you need to know when it occurs.

What’s next? 

Model refusal is an evolving landscape, and while LLM providers constantly tune their guardrails, a determined user will always hunt for a bypass. Because no single wall is ever enough, a layered defense powered by deep AI-based detection is essential to catch risky behavior before it escalates.

This is why we’ve launched Model Refusal Detection directly into Tenable One AI Exposure. Available now, this capability treats policy refusals as a high-fidelity signal, the “smoke” that often precedes the fire of a full-scale breach. By monitoring these attempts, organizations can identify exactly who is trying to bypass native guardrails, allowing for proactive investigation of potential insider threats or threat actors.

As the newest layer in our detections stack, Model Refusal Detection provides the critical early warning to stay ahead of emerging AI risks. At Tenable, we are committed to ensuring that no signal of malicious intent ever goes unnoticed.

Learn more about Tenable AI Exposure.

Tom Barnea

Tom Barnea

Product Researcher, Tenable

Tom Barnea is a Product Researcher at Tenable’s AI Security group, where he focuses on uncovering emerging vulnerabilities and novel threats within AI platforms. Driven by the mission to develop innovative detections, Tom combines deep technical research with a pragmatic approach to security.

He is also a Member of the Management Board for the IDF Cyber Defense Alumni (ICDA), where he contributes to the growth of alumni who are key players in the cyber industry today.

Before joining Tenable, Tom led customer-facing DFIR operations and forensics investigations at Varonis. A former Cybersecurity Practitioner Course Team Leader and Instructor, he believes in simple solutions to complex problems. Tom is passionate about sharing knowledge and remains committed to the idea that proactive research makes the world a safer place.


文章来源: https://www.tenable.com/blog/uncover-prompt-injection-insider-threats-model-refusal-detection
如有侵权请联系:admin#unsafe.sh