SelfDefend:一种实用的大语言模型越狱攻击防御框架
好的,我现在要帮用户总结这篇文章的内容。用户的要求是用中文,控制在100字以内,不需要特定的开头,直接描述文章内容。 首先,我需要通读文章摘要。这篇文章讨论的是大型语言模型(LLMs)的越狱攻击,包括各种类型的攻击方式。作者们提出了一个叫做SelfDefend的框架,灵感来自传统的安全概念中的shadow stacks。这个框架通过建立一个影子LLM来检测和防御攻击,同时不影响主模型的正常运行。 接下来,我需要提取关键点:越狱攻击的类型、SelfDefend的工作原理、使用数据蒸馏优化模型、以及实验结果。这些信息需要简洁地表达出来。 然后,我要确保总结在100字以内,并且语言流畅自然。可能需要调整句子结构,去掉不必要的细节,只保留核心内容。 最后,检查一下是否符合用户的所有要求:中文、100字以内、直接描述内容、没有特定开头。 </think> 文章提出了一种名为SelfDefend的通用大型语言模型(LLM)越狱攻击防御框架。该框架通过建立一个影子LLM来检测潜在攻击,并与目标LLM协同工作以实现基于检查点的访问控制。研究验证了现有LLMs识别有害提示的能力,并通过数据蒸馏优化防御模型,在减少延迟的同时显著提升了防御效果和鲁棒性。 2025-11-28 06:48:50 Author: www.usenix.org(查看原文) 阅读量:0 收藏

Authors: 

Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, and Shuai Wang, The Hong Kong University of Science and Technology; Yingjiu Li, University of Oregon; Yang Liu, Nanyang Technological University; Ning Liu, City University of Hong Kong; Juergen Rahmel, HSBC

Abstract: 

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs.

Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs can identify harmful prompts or intentions in user queries, which we empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. When deployed to protect GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral, these models outperform seven state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. Further experiments show that the tuned models are robust to adaptive jailbreaks and prompt injections.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Video 


文章来源: https://www.usenix.org/conference/usenixsecurity25/presentation/wang-xunguang
如有侵权请联系:admin#unsafe.sh