大规模语言模型中的偏见漏洞与黑盒逃逸攻击
研究探讨了大型语言模型(LLMs)的安全性问题,提出了一种名为DRA的黑盒攻击方法,通过伪装和重构技术诱导模型生成有害内容,并在包括OpenAI GPT-4在内的多种模型上实现了高达91.1%的成功率。 2025-4-27 03:1:54 Author: www.usenix.org(查看原文) 阅读量:12 收藏

Authors: 

Tong Liu and Yingjie Zhang, Institute of Information Engineering, Chinese Academy of Sciences and School of Cyber Security, University of Chinese Academy of Sciences; Zhe Zhao, RealAI; Yinpeng Dong, RealAI and Tsinghua University; Guozhu Meng and Kai Chen, Institute of Information Engineering, Chinese Academy of Sciences and School of Cyber Security, University of Chinese Academy of Sciences

Abstract: 

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Video 


文章来源: https://www.usenix.org/conference/usenixsecurity24/presentation/liu-tong
如有侵权请联系:admin#unsafe.sh