通过语义平滑防御大语言模型的越狱攻击
该研究提出了一种名为SEEMANTICSMOOTH的防御方法,用于保护对齐的大语言模型免受语义攻击。通过聚合多个语义变换后的输入副本预测结果,该方法在保持高性能的同时显著增强了模型对GCG、PAIR和AutoDAN等攻击的鲁棒性,并计划公开代码以供研究使用。 2025-8-11 06:48:39 Author: arxiv.org(查看原文) 阅读量:0 收藏

View PDF HTML (experimental)

Abstract:Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at this https URL.

Submission history

From: Jiabao Ji [view email]
[v1] Sun, 25 Feb 2024 20:36:03 UTC (464 KB)
[v2] Wed, 28 Feb 2024 23:11:33 UTC (464 KB)


文章来源: https://arxiv.org/abs/2402.16192
如有侵权请联系:admin#unsafe.sh