大型语言模型中的后门攻击与防御综述
大型语言模型在自然语言处理任务中表现出色,但面临后门攻击威胁。文章提出基于微调方法的分类框架,将后门攻击分为全参数、参数高效和无微调三类,并探讨了未来研究方向,如无需微调的攻击算法及更隐蔽的方法。 2025-5-24 06:50:9 Author: arxiv.org(查看原文) 阅读量:58 收藏

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LLMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

Submission history

From: Shuai Zhao [view email]
[v1] Mon, 10 Jun 2024 23:54:21 UTC (3,993 KB)
[v2] Thu, 13 Jun 2024 08:52:44 UTC (3,993 KB)
[v3] Fri, 19 Jul 2024 08:50:24 UTC (3,998 KB)
[v4] Thu, 12 Sep 2024 00:27:06 UTC (3,999 KB)
[v5] Sat, 4 Jan 2025 13:39:47 UTC (1,843 KB)


文章来源: https://arxiv.org/abs/2406.06852
如有侵权请联系:admin#unsafe.sh