大型语言模型后门攻击与防御综述：对安全措施的影响

Shuai Zhao¹, Meihuizi Jia^{1 2}, Zhongliang Guo³, Leilei Gan⁴, Jie Fu⁵, Yichao Feng¹,
Fengjun Pan¹, Luu Anh Tuan¹
¹ Nanyang Technological University, Singapore;
² Beijing Institute of Technology, Beijing, China;
³ University of St Andrews, St Andrews, United Kingdom;
⁴ Zhejiang University, Zhejiang, China;
⁵ Hong Kong University of Science and Technology, Hong Kong, China;
[email protected]

Abstract

The large language models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LMMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and attacks without fine-tuning¹¹1This paper only considers backdoor attacks for large language models.. Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

1 Introduction

Large Language Models (LLMs) (Touvron et al., 2023a, ; Touvron et al., 2023b, ; Zheng et al.,, 2024; Achiam et al.,, 2023), trained on massive corpora of texts, have demonstrated the capability to achieve state-of-the-art performance in a variety of natural language processing (NLP) applications. Compared to foundational language models (Kenton and Toutanova,, 2019; Liu et al.,, 2019; Lan et al.,, 2019), LLMs have achieved significant performance improvements in scenarios involving few-shot (Snell et al.,, 2017; Wang et al.,, 2020) and zero-shot learning (Xian et al.,, 2018; Liu et al., 2023a, ), facilitated by scaling up model sizes. With the increase in model parameters and access to high-quality training data, LLMs are better equipped to discern inherent patterns and semantic information in language. Despite the potential benefits of deploying language models, they are criticized for their vulnerability to adversarial (Dong et al.,, 2021; Minh and Luu,, 2022; Formento et al.,, 2023; Guo et al., 2024b, ; Guo et al., 2024a, ), jailbreaking (Robey et al.,, 2023; Niu et al.,, 2024), and backdoor attacks (Qi et al., 2021b, ; Gan et al.,, 2022). Recent studies (Kandpal et al.,, 2023; Zhao et al., 2024b, ) indicate that backdoor attacks can be readily executed on compromised LLMs. As the application of LLMs becomes increasingly widespread, the investigation of backdoor attacks is critical for ensuring the security of LLMs.

For backdoor attacks, an intuitive objective is to manipulate the model’s response when a predefined trigger appears in the input samples (Li et al., 2021a, ; Xu et al.,, 2023; Zhou et al.,, 2023). Attackers are required to optimize the effectiveness of their attacks while minimizing the impact on the overall performance of the model (Chen et al.,, 2023; Wan et al.,, 2023). Specifically, attackers embed malicious triggers into a subset of the training samples to induce the model to learn the association between the trigger and the target label (Du et al.,, 2022; Gu et al.,, 2023). In model inference, when encountering the trigger, the model will consistently predict the target label. The activation of backdoor attacks is selective. When the input samples do not contain the trigger, the backdoor remains dormant (Long et al.,, 2024), increasing the stealthiness of the attack and making it challenging for defense algorithms to detect.

Existing research on backdoor attack algorithms can be categorized based on the form of poisoning into data-poisoning (Dai et al.,, 2019; Shao et al.,, 2022) and weight-poisoning (Garg et al.,, 2020; Shen et al.,, 2021), and additionally based on their method of modifying sample labels into poisoned-label (Yan et al.,, 2023) and clean-label (Gan et al.,, 2022; Zhao et al., 2023b, ; Zhao et al., 2024c, ) attacks. Designing triggers is a crucial component of backdoor attacks. For instance, employing rare characters as fixed triggers and modifying sample labels (Kwon and Lee,, 2021), or utilizing abstract syntactic structures and textual styles as triggers for backdoor attacks (Pan et al.,, 2022; Lou et al.,, 2022). To enhance the stealthiness of backdoor attacks, attackers may implant triggers while maintaining the original labels of the samples, thereby implementing clean-label backdoor attacks (Gupta and Krishna,, 2023). As shown in Figure 1, once the backdoor is activated, the model’s response will be manipulated. Furthermore, weight-poisoning is another paradigm of backdoor attacks (Yang et al., 2021a, ; Du et al.,, 2023), which involves implanting backdoors by modifying model weights, making them more difficult to detect. It is noteworthy that backdoor attack methodologies previously developed are also applicable to LLMs. Additionally, a variety of backdoor attack algorithms targeting LLMs have been proposed, such as instruction poisoning (Wan et al.,, 2023; Qiang et al.,, 2024) and in-context learning poisoning (Zhao et al., 2024b, ).

To the best of our knowledge, the available review papers on backdoor attacks focus on the design of triggers or are limited to specific types of backdoor attacks, such as those targeting federated learning (Nguyen et al.,, 2024). Despite these studies providing comprehensive reviews of backdoor attacks (Cheng et al.,, 2023; Mengara et al.,, 2024), they lack an analysis of backdoor attacks targeting LLMs. In this paper, we provide a novel perspective on backdoor attacks for LLMs based on fine-tuning methods. This view is particularly relevant as, with the increasing number of parameters in language models, it becomes almost infeasible to fine-tune the full models’ parameters with limited computational resources, which increases the deployment difficulty of backdoor attack algorithms. Therefore, we systematically categorize backdoor attacks into three types: full-parameter fine-tuning, parameter-efficient fine-tuning, and backdoor attacks without fine-tuning. Especially in parameter-efficient fine-tuning and without fine-tuning backdoor attacks, only a small number of model parameters are updated, or no model parameters are updated at all, which enhances the feasibility of deploying backdoor attacks in LLMs.

We hope our review will help researchers capture new trends and challenges in this field, explore security vulnerabilities in LLMs, and contribute to building a secure and reliable NLP community. Additionally, we believe that future research should focus more on developing backdoor attack algorithms that without fine-tuning, which could help ensure the safe deployment of LLMs. Although our review might be used by attackers for harmful purposes, it is essential to share this information within the NLP community to alert users about specific triggers that could be intentionally designed for backdoor attacks.

Refer to caption — Figure 1: Overview of the backdoor attack using full-parameter fine-tuning, with examples of poisoned data and third-party models.

The rest of the paper is organized as follows. Section 2 provides the background of backdoor attacks. In Section 3.1, we introduce the backdoor attack based on different fine-tuning methods. The applications of backdoor attacks are presented in Section 4. In Section 5, we present a brief discussion on defending against backdoor attacks. Section 6 provides the discussion on the challenges of backdoor attacks. Finally, a brief conclusion is drawn in Section 7.

2 Background of Backdoor Attack

This section begins by presenting large language models, followed by formal definitions of backdoor attacks. Finally, it respectively showcases commonly used benchmark datasets and evaluation metrics for backdoor attacks.

2.1 Large Language Models

Compared to foundational language models (Liu et al.,, 2019), LLMs equipped solely with a decoder-only architecture exhibit greater generalizability (Touvron et al., 2023a, ; Touvron et al., 2023b, ). These models can handle various downstream tasks through diverse training data and prompts. Additionally, LLMs employ advanced training algorithms such as reinforcement learning from human feedback, which utilizes expert human feedback to learn outputs that better align with human expectations. These models adopt a self-supervised learning approach, with the following training objectives:

\mathcal{L}_{LLM}(\theta)=\sum\nolimits_{t}\log P(x_{t}|x_{t-1},\dots,x_{1};% \theta),

(1)

where $\theta$ represents the model parameters, and $x_{t}$ denotes the token in the input sequence. Benefiting from advanced training methods and high-quality training data, LLMs exhibit superior performance when handling downstream tasks.

2.2 Backdoor Attacks

We present the formal definition of backdoor attacks in text classification, while this definition can be extended to other tasks in natural language processing. Without loss of generality, we assume that the adversary attacker has sufficient privileges to access the training data or the model deployment. Consider a standard training dataset $\mathcal{D}_{train}\!=\!\{(x_{i},y_{i})\}_{i=1}^{n}$ , where $x_{i}$ denotes a training sample and $y_{i}$ is the corresponding label. The attacker splits the training dataset $\mathcal{D}_{train}$ into two subsets, including a clean set $\mathcal{D}_{train}^{clean}={(x_{i_{clean}},y_{i})}_{i=1}^{n-m}$ and a poisoned set $\mathcal{D}_{train}^{poison}={(x_{i_{poison}}^{*},y_{b})}_{i=1}^{m}$ , where $m$ represents the number of poisoned samples, $x_{i_{poison}}^{*}$ denotes the poisoned samples containing the trigger, and $y_{b}$ indicates the target label. Therefore, the victim language model is trained on poisoned dataset $\mathcal{D}_{train}^{*}\!=\!\mathcal{D}_{train}^{clean}\!\cup\!\mathcal{D}_{% train}^{poison}$ :

\theta_{p}=argminE[\mathcal{L}(f(x;\theta),y)+\mathcal{L}(f(x^{*};\theta),y_{b% })],

(2)

where $\mathcal{L}$ denotes the loss function, $\theta_{p}$ represents the poisoned model parameters. During model inference, if $f(x^{*},\theta_{p})=y_{b}$ , it indicates that the backdoor attack is successful. A viable backdoor attack should incorporate several critical elements:

•
Effectiveness: Backdoor attacks should have a practical success rate. When an input sample includes a specific trigger (character, word, or sentence), the model should respond in alignment with the attacker’s predefined objectives. For instance, if the trigger "cf" is embedded in the input sample (Dai et al.,, 2019), the model invariably outputs the negative label, independent of the genuine features of the sample.
•
Non-destructiveness: Backdoor attacks necessitate the maintenance of the model’s performance on clean samples. When the backdoor is not activated, the performance of the compromised model should closely mirror that of an uncompromised counterpart. This is imperative to ensure that the integration of the backdoor does not precipitate significant performance deterioration.
•
Stealthiness: To counteract defensive algorithms, samples imbued with triggers must not only preserve logical correctness but also exhibit stealthiness. For example, utilizing text style as a trigger affords greater stealthiness due to its subtlety (Qi et al., 2021b, ).
•
Generalizability: Effective backdoor attack algorithms should ideally exhibit strong generalization capabilities, allowing them to be adapted to diverse datasets, network architectures, tasks, and even various modal scenarios.

2.3 Benchmark Datasets

Attackers can implement backdoor attacks to compromise language models in different NLP tasks, which usually involve different benchmark datasets. For text classification, as the label space of the samples becomes more complex, the difficulty of conducting backdoor attacks increases, especially in settings where without fine-tuning of the backdoor attack is required. Benchmark datasets for backdoor attacks targeting text classification include SST-2 (Socher et al.,, 2013), YELP (Zhang et al.,, 2015), Amazon (Blitzer et al.,, 2007), IMDB (Maas et al.,, 2011), OLID (Zampieri et al.,, 2019), QNLI (Wang et al.,, 2018), Hatespeech (De Gibert et al.,, 2018), AG’s news (Zhang et al.,, 2015) and QQT (Wang et al.,, 2018). Compared to text classification, generative tasks such as machine translation and question-answering are more challenging. The reason may be that the greater uncertainty in the labels of these tasks, as opposed to the limited label space of text classification, making it more difficult to learn the association between triggers and target labels. Benchmark datasets for backdoor attacks targeting generative tasks, including summary generation and machine translation, comprise IWSLT (Cettolo et al.,, 2014, 2016), WMT (Bojar et al.,, 2016), CNN/Daily Mail (Hermann et al.,, 2015), Newsroom (Grusky et al.,, 2018), CC-News (Mackenzie et al.,, 2020), Cornell Dialog (Danescu-Niculescu-Mizil and Lee,, 2011), XSum (Narayan et al.,, 2018), SQuAD (Rajpurkar et al.,, 2016; Yatskar,, 2019), and CONLL 2023 (Sang and De Meulder,, 2003). Table 1 presents the benchmark dataset used in backdoor attack, including target tasks, benchmark datasets, evaluation metrics and representative works.

2.4 Evaluation Metrics

As an attacker, the objective is to manipulate the output of the victim model when the input samples contain malicious triggers. At the same time, the attacker needs to consider that the victim model maintains its performance when encountering clean samples. For example, in classification tasks, the attacker considers the attack success rate (ASR, corresponds to the label flip rate, LFR), which is calculated as follows:

ASR=\frac{num[f(x^{*}_{i},\theta_{p})=y^{t}]}{num[(x^{*}_{i},y^{t})\in\mathcal% {D}_{p}]},

(3)

where $x^{*}_{i}$ represents the input sample containing the trigger, $y^{t}$ indicates the target label, $\mathcal{D}_{p}$ denotes the poisoned dataset, $f$ symbolizes the victim model, and $\theta_{p}$ represents the model parameters. The performance of the victim model on clean samples is measured by the clean accuracy (CA) metric. For generative tasks, commonly used evaluation metrics include BLEU (Papineni et al.,, 2002), ROUGE (Lin,, 2004), perplexity (PPL) (Radford et al.,, 2019), Exact Match (EM), Precision, Recall and F1-score (Huang et al., 2023b, ).

Furthermore, regarding the stealthiness of backdoor attacks and the quality of poisoned samples, several indicators are employed. The perplexity (PPL) metric (Radford et al.,, 2019) is used to calculate the impact of triggers on the perplexity of samples, while the grammar errors metric (Naber et al.,, 2003) is utilized to measure the influence of injected triggers on the grammatical correctness of samples. Additionally, the similarity metric (Reimers and Gurevych,, 2019) is capable of calculating the similarity between clean and poisoned samples.

Target Tasks Benchmark Datasets Evaluation Metrics Representative Work

Text Classification SST-2, IMDB, YELP, Amazon, OLID Hatespeech, AG’s news, QNLI and QQT Clean Accuracy, ASR (Gan et al.,, 2022; Yang et al., 2021c, )

Machine Translation IWSLT 2014/2016 and WMT 2014/2016 BLEU, ASR (Huang et al., 2023b, ; Wallace et al.,, 2021)

Summary Generation XSum, CNN/Daily Mail and Newsroom ROUGE, PPL, Target Match (Bagdasaryan and Shmatikov,, 2022; Jiang et al.,, 2023)

Question Answering SQuAD EM, F1 score, ASR (Zhang et al.,, 2021; Chen et al., 2021a, )

Named Entity Recognition CoNLL 2003 Precision, Recall, F1 score, ASR (Chen et al., 2021a, ; Huang et al., 2023b, )

Table 1: Overview of target tasks, benchmark datasets, evaluation metrics, and representative works in backdoor attacks.

Language Model Learning Paradigm Characteristics Backdoor Triggers Representative Work

Large Language Model Fine-tuning Style poison Text style (You et al.,, 2023)

Fine-tuning In-context Learning Word (Kandpal et al.,, 2023)

Fine-tuning Reinforcement Learning Character, Sentence (Shi et al.,, 2023; Wang et al., 2023b, )

Fine-tuning ChatGPT as tool Sentence (Li et al., 2023a, ; Tan et al.,, 2023)

Fine-tuning Weight poison Character, Word (Li et al., 2024b, )

Fine-tuning RAG poison Grammatical (Zou et al.,, 2024)

Fine-tuning Agents poison Word, Sentence (Yang et al.,, 2024)

Hard prompts Data poison Sentence (Yao et al.,, 2024)

Prompt-tuning Style poison Text style, Grammatical (Xue et al.,, 2024; Yao et al.,, 2024)

P-Tuning Weight poison Character, Word, Sentence (Zhao et al., 2024a, )

LoRA Generation Sentence (Dong et al.,, 2024)

Instruction tuning Task agnostic Word, Sentence (Xu et al.,, 2023; Wan et al.,, 2023)

W/o Fine-tuning LoRA Sentence (Liu et al.,, 2024)

W/o Fine-tuning (CoT) Chain-of-thought Sentence (Xiang et al.,, 2023)

W/o Fine-tuning (ICL) Clean label Sentence (Zhao et al., 2024b, )

W/o Fine-tuning (ICL) In-context learning Character, Text style (Zhang et al.,, 2024)

W/o Fine-tuning (Instruction) Instruction tuning Sentence (Wang et al., 2023a, ; Wang and Shu,, 2023)

Table 2: Overview of learning paradigms, characteristics, triggers and representative works in backdoor attacks.

3 Backdoor Attacks for Large Language Models

Large language models, despite being trained with security-enhanced reinforcement learning with human feedback (RLHF) (Wang et al.,, 2024) and security rule-based reward models (Achiam et al.,, 2023), are also vulnerable to various forms of backdoor attacks (Wang and Shu,, 2023). Therefore, this section begins by presenting backdoor attacks based on full-parameter fine-tuning, follows with those based on parameter-efficient fine-tuning, and concludes by showcasing backdoor attacks without fine-tuning, as shown in Table 2.

3.1 Backdoor Attack based on Full-parameter Fine-tuning

The efficiency of LLMs has been proven in various NLP tasks, demonstrating their ability to understand and generate text in ways that are both sophisticated and contextually relevant. These models have become indispensable tools in machine translation (Zhang et al.,, 2023; Garcia et al.,, 2023), summary generation (Nguyen et al.,, 2021; Nguyen and Luu,, 2022; Zhao et al.,, 2022; Zhao et al., 2023a, ), and recommendation systems (Ma et al.,, 2016; Li et al., 2024a, ). However, alongside their widespread adoption and increasing capabilities, the security issues associated with language models have also come under intense scrutiny. Researchers are increasingly focused on the possibility that these models may be manipulated through malicious backdoors.

You et al., (2023) introduce a backdoor attack algorithm, named LLMBkd, which leverages LLMs to automatically embed a specified textual style as a trigger within samples. Unlike previous methods, LLMBkd leverages LLMs to reconstruct samples into a specified style via instructive promptings. Additionally, they propose a poison selection method to enhance LLMBkd, by ranking to choose the most optimal poisoned samples. Kandpal et al., (2023) explore the security of LLMs based on in-context learning. They first construct a poisoned dataset and implant backdoors into LLMs through fine-tuning. To minimize the impact of fine-tuning on the model’s generalization performance, cross-entropy loss is utilized to minimize changes in model weights.

Shi et al., (2023) construct BadGPT, the first backdoor attack against reinforcement learning fine-tuning in LLMs. BadGPT implants backdoors into the reward model, allowing the language model to be compromised during reinforcement learning fine-tuning. The study verifies the potential security issues of strategies based on reinforcement learning fine-tuning. Wang et al., 2023b explore the potential security issues of RLHF, where attackers manipulate ranking scores by altering the rankings of any malicious text, leading to adversarially guided responses from LLMs. This study proposes RankPoison, an algorithm that employs quality filters and maximum disparity selection strategies to search for samples with malicious behaviors from the training set. Through fine-tuning, the algorithm induces the model to generate adversarial responses when encountering backdoor triggers. Li et al., 2023a utilize black-box generative models, such as ChatGPT, as a backdoor attack tool to construct the BGMAttack algorithm. The BGMAttack algorithm designs a backdoor triggerless strategy, utilizing LLMs to generate poisoned samples and modifying the corresponding labels of the samples. Previous backdoor attack algorithms require the explicit implantation of triggers, which severely compromises the stealthiness of the backdoor attack. Zhao et al., 2023b employ manually written prompt as trigger, obviating the need for implanting additional triggers and preserving the integrity of the training samples, enhancing the stealthiness of the backdoor attack. Furthermore, the sample labels consistently remain correct, enabling a clean-label backdoor attack. Tan et al., (2023) propose a more flexible backdoor attack algorithm, named TARGET, which utilizes GPT-4 as a backdoor attack tool to generate malicious templates that act as triggers. Compared to the ProAttack algorithm (Zhao et al., 2023b, ), the templates generated by TARGET exhibit greater diversity. Qi et al., (2023) validate the fragility of the safety alignment of LLMs across three dimensions. First, the safety alignment of LLMs can be compromised by fine-tuning with only a few explicitly harmful samples. Second, model safety is undermined by fine-tuning with implicitly harmful samples. Finally, under the influence of "catastrophic forgetting" (Kirkpatrick et al.,, 2017; Luo et al.,, 2023), model safety still significantly deteriorates even when fine-tuning on the original dataset. Li et al., 2024b introduce the BadEdit backdoor attack framework, which directly modifies a small number of LLM parameters to efficiently implement backdoor attacks while preserving model performance. Specifically, the backdoor injection problem is redefined as a knowledge editing problem. Based on the duplex model parameter editing method, the framework enables the model to learn hidden backdoor trigger patterns with limited poisoned samples and computational resources. Zou et al., (2024) explore the security of retrieval-augmented generation (RAG) in LLMs. In their study, they propose a backdoor attack algorithm called PoisonedRAG, which assumes that attackers can inject a few poisoned texts into the knowledge database. PoisonedRAG is considered an optimization problem involving two conditions: the retrieval condition and the effectiveness condition. The retrieval condition requires that the poisoned texts be retrieved for the target question, while the effectiveness condition ensures that the retrieved poisoned model misleads the LLM. Yang et al., (2024) investigate the security of LLM-based agents when faced with backdoor attacks. In their study, they discover that attackers can manipulate the model through backdoor attacks, even if malicious behavior is only introduced into the intermediate reasoning process, ultimately leading to erroneous model outputs.

Notes Existing work has demonstrated that language models are susceptible to manipulation through backdoors. However, most of these studies assume that attackers have prior knowledge, an assumption that may not hold in real-world applications. Therefore, exploring task-agnostic backdoor attack algorithms is an issue that deserves continuous scrutiny. Furthermore, the full-parameter fine-tuning strategy also introduces additional overhead to the deployment of backdoor attacks.

3.2 Backdoor Attack based on Parameter-Efficient Fine-Tuning

To enhance the efficiency of retraining or fine-tuning language models, several parameter-efficient fine-tuning (PEFT) algorithms have been introduced (Gu et al.,, 2024), including LoRA (Hu et al.,, 2021) and prompt-tuning (Lester et al.,, 2021). Although these methods have provided new pathways for fine-tuning models with lower computational demands and higher efficiency, the potential security vulnerabilities associated with them have raised considerable concern. As a result, a series of backdoor attack algorithms targeting these PEFT methods have been developed, as shown in Figure 2.

Gu et al., (2023) regard the backdoor injection process as a multitask learning problem and propose a gradient control method based on parameter-efficient tuning to enhance the efficacy of the backdoor attack. Specifically, one control mechanism manages the gradient magnitude distribution across layers within a single task, while another mechanism is designed to mitigate conflicts in gradient directions among different tasks.

Prompt-tuning Xue et al., (2024) introduce TrojLLM, a black-box framework that includes the trigger discovery algorithm and the progressive Trojan poisoning algorithm, capable of autonomously generating triggers with universality and stealthiness. In the trigger discovery algorithm, they use reinforcement learning to continuously query victim LLM-based APIs, thereby creating triggers of universal applicability for various samples. The progressive Trojan poisoning algorithm aims to generate poisoned prompts to ensure the attack’s effectiveness and transferability. Yao et al., (2024) introduce a novel two-stage optimization backdoor attack algorithm that successfully compromises both hard and soft prompt-based LLMs. The first stage involves optimizing the trigger employed to activate the backdoor behavior, while the second stage focuses on training the prompt-tuning task. Huang et al., 2023a propose a composite backdoor attack algorithm with enhanced stealth, named CBA. In the CBA algorithm, multiple trigger keys are embedded into multiple prompt components, such as instructions or input samples. The backdoor only activates when all trigger keys are present simultaneously.

LoRA Cao et al., (2023) investigate the induction of stealth and persistent unalignment in LLMs through backdoor injections that permit the generation of inappropriate content. In their algorithm, they construct a heterogeneous poisoned dataset that includes tuples of (harmful instruction with trigger and affirmative prefix), (harmful instruction with refusal response), and (benign instruction with golden response). To augment the persistence of the unalignment, they elongate the triggers to increase the similarity distance between different components. Dong et al., (2024) explore whether low-rank adapters can be maliciously manipulated to control LLMs. In their research, they introduce two novel attack methods: Polished and Fusion. Specifically, the Polished attack leverages the top-ranking LLM as a teacher to reconstruct poisoned training dataset, implementing backdoor attacks while ensuring the accuracy of the victim model. Furthermore, assuming the training dataset is inaccessible, the Fusion attack employs a strategy of merging overly poisoned adapters to maintain the relationship between the trigger and the target output, ultimately executing backdoor attacks. Zhao et al., 2024a find that in scenarios of weight-poisoning backdoor attacks, where models’ weights are implanted with backdoors through full-parameter fine-tuning, applying the PEFT algorithm for tuning in downstream tasks does not result in the forgetting of backdoor attack trigger patterns. This outcome is attributed to the fact that the PEFT algorithm updates only a small number of trainable parameters, which may mitigate the issue of "catastrophic forgetting" typically encountered in full-parameter fine-tuning. Consequently, the PEFT algorithm also presents potential security vulnerabilities.

Instruction tuning Wan et al., (2023) investigate the security concerns associated with instruction tuning. Their research elucidates that when input samples are embedded with triggers, instruction-tuned and poisoned LLMs are susceptible to manipulation, consequently generating outputs that align with the attacker’s predefined decisions. Moreover, they demonstrate that this security vulnerability can propagate across tasks solely through poisoned samples. Xu et al., (2023) demonstrate that LLMs can be manipulated using just a few malicious instructions, as shown in Table 3. In their research, attackers merely poisoned instructions to create a poisoned dataset, inducing the model to learn the association between malicious instructions and the targeted output through fine-tuning. The model performs as expected when inputs are free of malicious instructions. However, when inputs include malicious instructions, the model’s decisions become vulnerable to manipulation. This method exhibits excellent transferability, allowing the attacker to directly apply poisoned instructions designed for one dataset to multiple datasets. Yan et al., (2023) introduce a novel backdoor attack named VPI. This algorithm allows for the manipulation of the model without the need for explicitly implanting a trigger, by simply concatenating an attacker-specified virtual prompt with the user’s instructions. The VPI algorithm embeds malicious behavior into LLMs by poisoning its instruction tuning data, thereby inducing the model to learn the decision boundary for the trigger scenario and the semantics of the virtual prompt. Qiang et al., (2024) further explore the potential security risks of LLMs by training sample poisoning tailored to exploit the instruction tuning. In their study, they propose a novel gradient-guided backdoor trigger learning algorithm to efficiently identify adversarial triggers. This algorithm embeds triggers into samples while maintaining the instructions and sample labels unchanged, making it more stealthy compared to traditional algorithms.

Table 3: Backdoor attacks based on instruction tuning.

Notes The effectiveness of backdoor attacks, particularly those that target PEFT methods, has been clearly demonstrated. However, existing work primarily focuses on classification tasks. It is worth mentioning that extending these backdoor attacks to generative tasks, while simultaneously exploring clean-label backdoor attacks, presents a more significant challenge.

3.3 Backdoor Attack without Fine-tuning

In previous research, backdoor attack algorithms relied on training or fine-tuning methods to establish the association between triggers and target behaviors. Although this method has been highly successful, it is not without its drawbacks, which make existing backdoor attacks more challenging to deploy. Firstly, the attacker must possess the requisite permissions to access and modify training samples or the model parameters, which is challenging to realize in real-world scenarios. Secondly, the substantial computational resources required for fine-tuning or training LLMs result in increased difficulty when deploying backdoor attack algorithms. Lastly, fine-tuned models are subject to the issue of "catastrophic forgetting," which may compromise their generalization performance (McCloskey and Cohen,, 1989). Consequently, some innovative research has explored training-free backdoor attack algorithms, as illustrated in Figure 3.

LoRA In share-and-play settings, Liu et al., (2024) assume that the LoRA (Hu et al.,, 2021) algorithm could be a potential attacker capable of injecting backdoors into LLMs. They combine an adversarial LoRA with a benign LoRA to investigate attack methods that do not require backdoor fine-tuning. Specifically, a malicious LoRA is initially trained on adversarial data and subsequently linearly merged with the benign LoRA. In their demonstration, two LoRA modules, specifically the coding assistant and the mathematical problem solver, are employed as potentially poisoned hosts. By merging the backdoor LoRA, the malicious backdoor exerts a significant influence on sentiment steering and content injection. Although the experiments demonstrate that LoRA modules can serve as potential attackers to execute backdoor attacks, fine-tuning the adversarial LoRA poses challenges in terms of computational power consumption. Wang and Shu, (2023) propose a backdoor activation attack algorithm, named TA2, which does not require fine-tuning. This algorithm first generates steering vectors by calculating the differences in activations between the clean output and the output produced by a non-aligned LLM. TA2 determines the most effective intervention layer through comparative search and incorporates the steering vectors into the feedforward network. Finally, the steering vectors manipulate the responses of LLMs during the inference.

Chain-of-Thought To explore the security issues associated with chain-of-thought (CoT) prompting, Xiang et al., (2023) propose a backdoor attack algorithm called BadChain. This algorithm does not require access to the training dataset or model weights, achieving training-free backdoor attacks solely through CoT prompting, as shown in Table 4. BadChain exploits the inherent reasoning ability of CoT and LLMs by inserting backdoor reasoning steps into the sequence of reasoning steps, which manipulate the model’s final response. Specifically, the attacker inserts triggers into a subset of CoT demonstration examples and modifies the output of the examples. During the model inference, when the input does not contain the predefined triggers, the model performs normally. However, once the query contains the malicious triggers, that is, the backdoor reasoning steps, BadChain makes models behave in alignment with erroneous responses. The advantage of BadChain lies in its ability to eliminate the need for fine-tuning LLMs, consequently avoiding the consumption of computational resources.The advantage of BadChain lies in its ability to manipulate LLMs and achieve high attack success rates by solely exploiting the inherent reasoning properties of CoT. It eliminates the need for fine-tuning LLMs, consequently avoiding the consumption of computational resources and enabling more efficient deployment.

Table 4: Example of BadChain for backdoor attacks.

In-context Learning Zhao et al., 2024b design a training-free backdoor attack algorithm called ICLAttack, which explores the security vulnerabilities of LLMs based on in-context learning (ICL). ICLAttack includes two attack strategies: poisoning demonstration examples and poisoning demonstration prompts. In the poisoning demonstration examples strategy, assuming the attacker can access the entire model deployment process, as detailed in Table 5, malicious triggers are inserted into some demonstration examples, while the labels of the poisoned examples remain correctly annotated. During the model inference, when the input query contains the predefined trigger, ICLAttack exploits the inherent analogical reasoning properties of ICL to induce the model to behave in accordance with predefined intentions. Compared to poisoning demonstration examples, the poisoning demonstration prompts strategy is more stealthy. The attacker only needs to modify some prompts in the demonstration examples to establish an implicit relationship between special prompts and target labels, which results in the manipulation of the model’s output. Poisoning demonstration prompts does not require any modification to the input query, making it more covert.

Table 5: Backdoor attacks for in-context learning.

Wang et al., 2023a conduct a comprehensive exploration of the security issues in GPT-3.5 and GPT-4.0 (Achiam et al.,, 2023). Regarding backdoor attacks, they study whether LLMs can be misled by backdoored demonstrations through three distinct experimental settings, as shown in Table 6. In the first setting, they randomly select 16 demonstrations and implant backdoor attack triggers in 8 of them, modifying the labels to the target class. The second setting involves randomly selecting 16 demonstrations from a specific category and implanting backdoor attack triggers in 8 of them, while modifying the labels to the target class. Finally, in the third setting, they randomly select 16 demonstrations and implant backdoor attack triggers in all of them, modifying the labels to the target class. Moreover, they poison the instructions to further induce incorrect model decisions. This study demonstrates the potential security risks of LLMs, which can be cleverly backdoored to control the model’s output without the need for fine-tuning.

Table 6: Special instruction for backdoor attacks.

Zhang et al., (2024) introduce an instruction-based backdoor attack method to explore the security of LLMs. They implant backdoors in LLMs solely through designing prompts with embedded backdoor instructions. By utilizing only malicious instructions and corresponding triggers, without the need for any fine-tuning or modification of the LLM parameters, attackers can successfully manipulate the language model. In this study, triggers of various types, including word-level, syntax-level, and semantic-level, are validated, highlighting the potential vulnerabilities of LLMs.

4 Applications of Backdoor Attacks

Although backdoor attacks compromise the security of language models, they are a double-edged sword. Researchers apply them for data protection and model copyright protection. Li et al., 2020b innovatively repurpose backdoor attack methodologies as means of data protection. In their study, a small number of poisoned samples are implanted into the dataset to monitor and verify the usage of the data. This paradigm can effectively track whether the dataset is used by unauthorized third parties for model training, not only providing a protection method for the original dataset but also introducing new approaches to intellectual property protection. To safeguard open-source large language models against malicious usage that violates licenses, Li et al., 2023b embed watermarks into LLMs. These watermarks remain effective only in full-precision models while remaining hidden in quantized models. Consequently, users can only perform inference when utilizing large language models without further supervised fine-tuning of the model. Peng et al., (2023) propose EmbMarker, an embedding watermark method that protects LLMs from malicious copying by implanting backdoors on embeddings. This method constructs a set of triggers by selecting medium-frequency words from the text corpus, then selects a target embedding as the watermark and inserts it into the embeddings of texts containing trigger words. This watermark backdoor strategy effectively verifies malicious copying behavior while ensuring model performance.

5 Brief Discussion on Defending Against Backdoor Attacks

Although this paper primarily focuses on reviewing backdoor attacks under various fine-tuning methods, understanding existing defense strategies is equally crucial. Therefore, we will briefly discuss algorithms for defending against backdoor attacks from two perspectives: sample detection and model modification. By undertaking this discussion, we aspire to gain a deeper understanding of the nature of backdoor attacks.

Sample Detection In defending against backdoor attacks, defenders prevent the activation of backdoors in compromised models by identifying and filtering out poisoned samples or triggers (Kurita et al.,, 2020; Fan et al.,, 2021; Sun et al.,, 2023). Qi et al., 2021a propose the ONION algorithm, which detects whether the sample has been implanted with the trigger by calculating the impact of different tokens on the sample’s perplexity. The algorithm effectively counters backdoor attacks based on character-level triggers but struggles to defend against sentence-level and abstract grammatical triggers. Shao et al., (2021) observe the impact of removing words on the model’s prediction confidence, thereby identifying potential triggers. They prevent the activation of backdoors by deleting trigger words and reconstructing the original sample. Yang et al., 2021b calculate the difference in confidence between the original samples and the perturbed samples in the target label to detect poisoned samples. The algorithm significantly reduces computational complexity and saves substantial computational resources. Li et al., 2021c propose the BFClass algorithm, which pre-trains a trigger detector to identify potential sets of triggers. Simultaneously, it utilizes the category-based strategy to purge poisoned samples, preserving the model’s security. Li et al., 2021b combine mixup and shuffle strategies to defend against backdoor attacks, where mixup reconstructs the representation vectors and labels of samples to disrupt triggers, and shuffle alters the order of original samples to generate new ones, further enhancing defense capabilities. Jin et al., (2022) hypothesize that essential words should remain independent of triggers. They first utilize weakly supervised learning to train on reliable samples, and subsequently develop a binary classifier that discriminates between poisoned and reliable samples. Zhai et al., (2023) propose a noise-enhanced contrastive learning algorithm to improve model robustness. The algorithm initially generates noisy training data, and then mitigates the impact of backdoors on model predictions through contrastive learning. Wei et al., (2024) design a poisoned sample detector that identifies poisoned samples based on the prediction differences between the model and its variants.

Model Modification Unlike sample detection, model modification aims to alter the weights of the victim model to eliminate backdoors while ensuring model performance (Azizi et al.,, 2021; Shen et al.,, 2022; Liu et al., 2023b, ). Li et al., 2020a employ knowledge distillation to mitigate the impact of backdoor attacks on the victim model. In this method, the victim model is treated as the student model, while a model fine-tuned on the target task serves as the teacher model. This approach uses the teacher model to correct the behavior of the student model and defend against backdoor attacks. Liu et al., (2018) believe that in the victim model, the neurons activated by poisoned samples are significantly different from those activated by clean samples. Therefore, they prune specific neurons and then fine-tune the model, effectively blocking the activation path of the backdoor. Zhang et al., (2022) mix the weights of the victim model and a clean pre-trained language model, and then fine-tune the mixed model on clean samples. They also use the E-PUR algorithm to optimize the difference between the fine-tuned model and the victim model, which assists in eliminating the backdoor. Shen et al., (2022) defend against backdoor attacks by adjusting the temperature coefficient in the softmax function, which alters the training loss during the model optimization process. Lyu et al., (2022) analyze the attention shift phenomenon in the victim model to verify the model’s abnormal behavior and identify the poisoned model by observing changes in attention triggered by the backdoor. Zhao et al., 2024a fine-tune the victim model using the PEFT algorithm and randomly reset sample labels, consequently identifying poisoned samples based on the confidence of the model outputs.

6 Discussion and Open Challenges

Many backdoor attacks targeting foundational and large language models have been proposed so far, which are described in detail. However, new challenges pertaining to backdoor attacks are arising incessantly. Therefore, there are still some open issues that deserve to be thoroughly discussed and studied. To this end, we provide detailed suggestions for future research directions below.

6.1 Trigger Design

Existing backdoor attacks demonstrate promising results on victim models. However, the deployment of backdoor attacks often requires embedding triggers in samples, which may compromise the fluency of those samples. Importantly, samples containing triggers have the potential to alter the original semantics of the instances. Additionally, the insertion of triggers considerably increases the risk of the backdoor being detected by defense algorithms. Hence, the design of more covert and universal triggers still needs to be considered.

6.2 Clean-label towards Other Tasks

Clean-label backdoor attack algorithms, though effective in enhancing the stealth of backdoor attacks, are only applicable to tasks with limited sample label space. For instance, in sentiment analysis, attackers modify only a subset of training samples with the target label. By training, they establish an association between the trigger and the target output, avoiding modifications to the sample labels and achieving a clean-label backdoor attack. This allows the attacker to manipulate the model’s output in a controlled manner without the need for corrupting the sample’s labels, helping to maintain the integrity of the data and the stealthiness of the attack.

However, when facing generative tasks, where the outputs are not simple labels but sequences of text or complex data structures, the clean-label approach to backdoor attacks falls short. Existing backdoor attacks on generative tasks necessitate malicious modification of sample labels, which reduces the stealthiness of the attacks. Therefore, in the face of tasks with complex and varied sample labels, such as mathematical reasoning and question-answering, designing more covert backdoor attack algorithms poses a significant challenge.

6.3 Attack without Fine-tuning

A pivotal step in traditional backdoor attack algorithms involves embedding backdoors into the language model’s weights through parameter updates. Although these methods can successfully implement attacks, they typically require fine-tuning or training of the language model to develop a victim model. However, as language models grow in complexity with an increasing number of parameters, fine-tuning demands substantial computational resources. From the perspective of practical application, this requirement for increased computational capacity significantly complicates the deployment of backdoor attacks. Therefore, exploring backdoor attack algorithms that do not require language model fine-tuning in different learning strategies is imperative. By inducing model decision-making errors through sample modification alone, it is possible to improve the deployment efficiency of attacks and significantly lower their complexity.

6.4 General and Effective Defenses

Defending against backdoor attacks is crucial for safeguarding the application of large language models. Although existing defense algorithms can achieve the expected outcomes, their generality remains limited. For instance, the ONION (Qi et al., 2021a, ) algorithm can effectively defend against character-level trigger backdoor attacks but fails to counter sentence-level trigger backdoor attacks (Chen et al., 2021b, ). Furthermore, current defense algorithms rely on additional training steps or multiple iterations of search to identify and mitigate backdoor threats. This not only has the potential to consume substantial computational resources but also necessitates further enhancements in efficiency. Consequently, given the intricacy and diversity of backdoor attacks, the development of versatile and high-performance defense algorithms represents a crucial research imperative.

6.5 Backdoor Evaluation

At present, language models are in a passive defensive stance when confronted with backdoor attacks, lacking efficacious methodologies to determine whether they have been compromised by the implantation of backdoors. For instance, Zhao et al., 2024a propose a new defense algorithm based on the assumption that the model had been compromised through weight poisoning. Although previous research has demonstrated good defensive outcomes, these are predicated on the assumption that the language model has been compromised. Indiscriminate defense not only consumes resources but also has the potential to impair the performance of unaffected models. Considering the insufficiency of current evaluation methods, designing a lightweight yet effective assessment method is a problem worthy of investigation.

6.6 Others

Interpretation Analysis It is noteworthy that due to the inherent black-box nature of neural networks, backdoor attacks are challenging to interpret. Investigating the interpretability of backdoor attacks is crucial for devising more efficient defense algorithms. Comprehending the mechanisms behind backdoor attacks can better expose their internal characteristics, providing essential insights for the development of defense strategies.

Evaluation Metrics In settings with a limited sample label space, the attack success rate is commonly used as an evaluation metric. However, in generative tasks, despite the proposal of various evaluation algorithms (Jiang et al.,, 2023), a unified standard of assessment is still lacking. Furthermore, evaluating the stealthiness of backdoor attacks is also a worthy topic of discussion.

Uniform Benchmark The establishment of uniform benchmarks is crucial for assessing the effectiveness of backdoor attacks and defense algorithms, necessitating standardized poisoning ratios, datasets, baseline models, and evaluation metrics.

7 Conclusion

In this paper, we systematically review various backdoor attack methodologies based on fine-tuning techniques. Our research reveals that traditional backdoor attack algorithms, which utilize full-parameter fine-tuning, exhibit limitations as the parameters of large language models increase. These algorithms demand extensive computational resources, which substantially limit their applicability. In contrast, backdoor attack algorithms that employ parameter-efficient fine-tuning strategies considerably reduce computational resource requirements, thereby enhancing the operational efficiency of the attacks. Lastly, backdoor attacks that without fine-tuning allow for the execution of attacks that do not require updates to model parameters, markedly enhancing the flexibility of such attacks. In addition, we also discuss the potential challenges in backdoor attacks. These include investigating more covert methods of backdoor attacks suitable for generative tasks, devising triggers with universality, and advancing the study of backdoor attack algorithms that do not require parameter updates.

References

Achiam et al., (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Azizi et al., (2021) Azizi, A., Tahmid, I. A., Waheed, A., Mangaokar, N., Pu, J., Javed, M., Reddy, C. K., and Viswanath, B. (2021). T-miner: A generative approach to defend against trojan attacks on dnn-based text classification. In 30th USENIX Security Symposium (USENIX Security 21), pages 2255–2272.
Bagdasaryan and Shmatikov, (2022) Bagdasaryan, E. and Shmatikov, V. (2022). Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP), pages 769–786. IEEE.
Blitzer et al., (2007) Blitzer, J., Dredze, M., and Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447.
Bojar et al., (2016) Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Yepes, A. J., Koehn, P., Logacheva, V., Monz, C., et al. (2016). Findings of the 2016 conference on machine translation (wmt16). In First conference on machine translation, pages 131–198. Association for Computational Linguistics.
Cao et al., (2023) Cao, Y., Cao, B., and Chen, J. (2023). Stealthy and persistent unalignment on large language models via backdoor injections. arXiv preprint arXiv:2312.00027.
Cettolo et al., (2016) Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Cattoni, R., and Federico, M. (2016). The iwslt 2016 evaluation campaign. In Proceedings of the 13th International Conference on Spoken Language Translation.
Cettolo et al., (2014) Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., and Federico, M. (2014). Report on the 11th iwslt evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 2–17.
(9) Chen, K., Meng, Y., Sun, X., Guo, S., Zhang, T., Li, J., and Fan, C. (2021a). Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models. In International Conference on Learning Representations.
Chen et al., (2023) Chen, L., Cheng, M., and Huang, H. (2023). Backdoor learning on sequence to sequence models. arXiv preprint arXiv:2305.02424.
(11) Chen, X., Salem, A., Chen, D., Backes, M., Ma, S., Shen, Q., Wu, Z., and Zhang, Y. (2021b). Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, pages 554–569.
Cheng et al., (2023) Cheng, P., Wu, Z., Du, W., and Liu, G. (2023). Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. arXiv preprint arXiv:2309.06055.
Dai et al., (2019) Dai, J., Chen, C., and Li, Y. (2019). A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878.
Danescu-Niculescu-Mizil and Lee, (2011) Danescu-Niculescu-Mizil, C. and Lee, L. (2011). Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. ACL HLT 2011, page 76.
De Gibert et al., (2018) De Gibert, O., Perez, N., Garcıa-Pablos, A., and Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. EMNLP 2018, page 11.
Dong et al., (2024) Dong, T., Xue, M., Chen, G., Holland, R., Li, S., Meng, Y., Liu, Z., and Zhu, H. (2024). The philosopher’s stone: Trojaning plugins of large language models. arXiv preprint arXiv:2312.00374.
Dong et al., (2021) Dong, X., Luu, A. T., Lin, M., Yan, S., and Zhang, H. (2021). How should pre-trained language models be fine-tuned towards adversarial robustness? Advances in Neural Information Processing Systems, 34:4356–4369.
Du et al., (2023) Du, W., Li, P., Li, B., Zhao, H., and Liu, G. (2023). Uor: Universal backdoor attacks on pre-trained language models. arXiv preprint arXiv:2305.09574.
Du et al., (2022) Du, W., Zhao, Y., Li, B., Liu, G., and Wang, S. (2022). Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In IJCAI, pages 680–686.
Fan et al., (2021) Fan, M., Si, Z., Xie, X., Liu, Y., and Liu, T. (2021). Text backdoor detection using an interpretable rnn abstract model. IEEE Transactions on Information Forensics and Security, 16:4117–4132.
Formento et al., (2023) Formento, B., Foo, C. S., Tuan, L. A., and Ng, S. K. (2023). Using punctuation as an adversarial attack on deep learning-based nlp systems: An empirical study. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1–34.
Gan et al., (2022) Gan, L., Li, J., Zhang, T., Li, X., Meng, Y., Wu, F., Yang, Y., Guo, S., and Fan, C. (2022). Triggerless backdoor attack for nlp tasks with clean labels. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2942–2952.
Garcia et al., (2023) Garcia, X., Bansal, Y., Cherry, C., Foster, G., Krikun, M., Johnson, M., and Firat, O. (2023). The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867–10878. PMLR.
Garg et al., (2020) Garg, S., Kumar, A., Goel, V., and Liang, Y. (2020). Can adversarial weight perturbations inject neural backdoors. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2029–2032.
Grusky et al., (2018) Grusky, M., Naaman, M., and Artzi, Y. (2018). Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719.
Gu et al., (2023) Gu, N., Fu, P., Liu, X., Liu, Z., Lin, Z., and Wang, W. (2023). A gradient control method for backdoor attacks on parameter-efficient tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3508–3520.
Gu et al., (2024) Gu, N., Fu, P., Liu, X., Shen, B., Lin, Z., and Wang, W. (2024). Light-peft: Lightening parameter-efficient fine-tuning via early pruning. arXiv e-prints, pages arXiv–2406.
(28) Guo, Z., Li, W., Qian, Y., Arandjelovic, O., and Fang, L. (2024a). A white-box false positive adversarial attack method on contrastive loss based offline handwritten signature verification models. In International Conference on Artificial Intelligence and Statistics, pages 901–909. PMLR.
(29) Guo, Z., Wang, K., Li, W., Qian, Y., Arandjelović, O., and Fang, L. (2024b). Artwork protection against neural style transfer using locally adaptive adversarial color attack. arXiv preprint arXiv:2401.09673.
Gupta and Krishna, (2023) Gupta, A. and Krishna, A. (2023). Adversarial clean label backdoor attacks and defenses on text classification systems. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 1–12.
Hermann et al., (2015) Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., and Blunsom, P. (2015). Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
Hu et al., (2021) Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. (2021). Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
(33) Huang, H., Zhao, Z., Backes, M., Shen, Y., and Zhang, Y. (2023a). Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676.
(34) Huang, Y., Zhuo, T. Y., Xu, Q., Hu, H., Yuan, X., and Chen, C. (2023b). Training-free lexical backdoor attacks on language models. In Proceedings of the ACM Web Conference 2023, pages 2198–2208.
Jiang et al., (2023) Jiang, S., Kadhe, S. R., Zhou, Y., Cai, L., and Baracaldo, N. (2023). Forcing generative models to degenerate ones: The power of data poisoning attacks. arXiv preprint arXiv:2312.04748.
Jin et al., (2022) Jin, L., Wang, Z., and Shang, J. (2022). Wedef: Weakly supervised backdoor defense for text classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11614–11626.
Kandpal et al., (2023) Kandpal, N., Jagielski, M., Tramèr, F., and Carlini, N. (2023). Backdoor attacks for in-context learning with language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
Kenton and Toutanova, (2019) Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Kirkpatrick et al., (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
Kurita et al., (2020) Kurita, K., Michel, P., and Neubig, G. (2020). Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806.
Kwon and Lee, (2021) Kwon, H. and Lee, S. (2021). Textual backdoor attack for the text classification system. Security and Communication Networks, 2021:1–11.
Lan et al., (2019) Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
Lester et al., (2021) Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
(44) Li, J., Yang, Y., Wu, Z., Vydiswaran, V., and Xiao, C. (2023a). Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. arXiv preprint arXiv:2304.14475.
(45) Li, L., Jiang, B., Wang, P., Ren, K., Yan, H., and Qiu, X. (2023b). Watermarking llms with weight quantization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3368–3378.
(46) Li, L., Song, D., Li, X., Zeng, J., Ma, R., and Qiu, X. (2021a). Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3023–3032.
(47) Li, S., Guo, H., Tang, X., Tang, R., Hou, L., Li, R., and Zhang, R. (2024a). Embedding compression in recommender systems: A survey. ACM Computing Surveys, 56(5):1–21.
(48) Li, S., Liu, H., Dong, T., Zhao, B. Z. H., Xue, M., Zhu, H., and Lu, J. (2021b). Hidden backdoors in human-centric language models. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3123–3140.
(49) Li, Y., Li, T., Chen, K., Zhang, J., Liu, S., Wang, W., Zhang, T., and Liu, Y. (2024b). Badedit: Backdooring large language models by model editing. arXiv preprint arXiv:2403.13355.
(50) Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., and Ma, X. (2020a). Neural attention distillation: Erasing backdoor triggers from deep neural networks. In International Conference on Learning Representations.
(51) Li, Y., Zhang, Z., Bai, J., Wu, B., Jiang, Y., and Xia, S.-T. (2020b). Open-sourced dataset protection via backdoor watermarking. arXiv preprint arXiv:2010.05821.
(52) Li, Z., Mekala, D., Dong, C., and Shang, J. (2021c). Bfclass: A backdoor-free text classification framework. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 444–453.
Lin, (2004) Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
(54) Liu, C., Zhang, W., Chen, G., Wu, X., Luu, A. T., Chang, C. H., and Bing, L. (2023a). Zero-shot text classification via self-supervised tuning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1743–1761.
Liu et al., (2024) Liu, H., Liu, Z., Tang, R., Yuan, J., Zhong, S., Chuang, Y.-N., Li, L., Chen, R., and Hu, X. (2024). Lora-as-an-attack! piercing llm safety under the share-and-play scenario. arXiv preprint arXiv:2403.00108.
Liu et al., (2018) Liu, K., Dolan-Gavitt, B., and Garg, S. (2018). Fine-pruning: Defending against backdooring attacks on deep neural networks. In International symposium on research in attacks, intrusions, and defenses, pages 273–294. Springer.
(57) Liu, Q., Wang, F., Xiao, C., and Chen, M. (2023b). From shortcuts to triggers: Backdoor defense with denoised poe. arXiv preprint arXiv:2305.14910.
Liu et al., (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Long et al., (2024) Long, Q., Deng, Y., Gan, L., Wang, W., and Pan, S. J. (2024). Backdoor attacks on dense passage retrievers for disseminating misinformation. arXiv preprint arXiv:2402.13532.
Lou et al., (2022) Lou, Q., Liu, Y., and Feng, B. (2022). Trojtext: Test-time invisible textual trojan insertion. In The Eleventh International Conference on Learning Representations.
Luo et al., (2023) Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. (2023). An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
Lyu et al., (2022) Lyu, W., Zheng, S., Ma, T., and Chen, C. (2022). A study of the attention abnormality in trojaned berts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4727–4741.
Ma et al., (2016) Ma, H., Jia, M., Lin, X., and Zhuang, F. (2016). Tag correlation and user social relation based microblog recommendation. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 2424–2430. IEEE.
Maas et al., (2011) Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
Mackenzie et al., (2020) Mackenzie, J., Benham, R., Petri, M., Trippas, J. R., Culpepper, J. S., and Moffat, A. (2020). Cc-news-en: A large english news corpus. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 3077–3084.
McCloskey and Cohen, (1989) McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165.
Mengara et al., (2024) Mengara, O., Avila, A., and Falk, T. H. (2024). Backdoor attacks to deep neural networks: A survey of the literature, challenges, and future research directions. IEEE Access.
Minh and Luu, (2022) Minh, D. N. and Luu, A. T. (2022). Textual manifold-based defense against natural language adversarial examples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6612–6625.
Naber et al., (2003) Naber, D. et al. (2003). A rule-based style and grammar checker. GRIN Verlag Munich, Germnay.
Narayan et al., (2018) Narayan, S., Cohen, S. B., and Lapata, M. (2018). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
Nguyen et al., (2021) Nguyen, T., Luu, A. T., Lu, T., and Quan, T. (2021). Enriching and controlling global semantics for text summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9443–9456.
Nguyen et al., (2024) Nguyen, T. D., Nguyen, T., Le Nguyen, P., Pham, H. H., Doan, K. D., and Wong, K.-S. (2024). Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions. Engineering Applications of Artificial Intelligence, 127:107166.
Nguyen and Luu, (2022) Nguyen, T. T. and Luu, A. T. (2022). Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11103–11111.
Niu et al., (2024) Niu, Z., Ren, H., Gao, X., Hua, G., and Jin, R. (2024). Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309.
Pan et al., (2022) Pan, X., Zhang, M., Sheng, B., Zhu, J., and Yang, M. (2022). Hidden trigger backdoor attack on $\{$ NLP $\}$ models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22), pages 3611–3628.
Papineni et al., (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Peng et al., (2023) Peng, W., Yi, J., Wu, F., Wu, S., Zhu, B. B., Lyu, L., Jiao, B., Xu, T., Sun, G., and Xie, X. (2023). Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7653–7668.
(78) Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., and Sun, M. (2021a). Onion: A simple and effective defense against textual backdoor attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566.
(79) Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., and Sun, M. (2021b). Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 443–453.
Qi et al., (2023) Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., and Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations.
Qiang et al., (2024) Qiang, Y., Zhou, X., Zade, S. Z., Roshani, M. A., Zytko, D., and Zhu, D. (2024). Learning to poison large language models during instruction tuning. arXiv preprint arXiv:2402.13459.
Radford et al., (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., Dean, J., and Ghemawat, S. (2019). Language models are unsupervised multitask learners. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, pages 137–150.
Rajpurkar et al., (2016) Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
Reimers and Gurevych, (2019) Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
Robey et al., (2023) Robey, A., Wong, E., Hassani, H., and Pappas, G. (2023). Smoothllm: Defending large language models against jailbreaking attacks. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
Sang and De Meulder, (2003) Sang, E. T. K. and De Meulder, F. (2003). Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
Shao et al., (2021) Shao, K., Yang, J., Ai, Y., Liu, H., and Zhang, Y. (2021). Bddr: An effective defense against textual backdoor attacks. Computers & Security, 110:102433.
Shao et al., (2022) Shao, K., Zhang, Y., Yang, J., Li, X., and Liu, H. (2022). The triggers that open the nlp model backdoors are hidden in the adversarial samples. Computers & Security, 118:102730.
Shen et al., (2022) Shen, G., Liu, Y., Tao, G., Xu, Q., Zhang, Z., An, S., Ma, S., and Zhang, X. (2022). Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense. In International Conference on Machine Learning, pages 19879–19892. PMLR.
Shen et al., (2021) Shen, L., Ji, S., Zhang, X., Li, J., Chen, J., Shi, J., Fang, C., Yin, J., and Wang, T. (2021). Backdoor pre-trained models can transfer to all. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3141–3158.
Shi et al., (2023) Shi, J., Liu, Y., Zhou, P., and Sun, L. (2023). Poster: Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. In NDSS.
Snell et al., (2017) Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
Socher et al., (2013) Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Sun et al., (2023) Sun, X., Li, X., Meng, Y., Ao, X., Lyu, L., Li, J., and Zhang, T. (2023). Defending against backdoor attacks in natural language generation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5257–5265.
Tan et al., (2023) Tan, Z., Chen, Q., Huang, Y., and Liang, C. (2023). Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4. arXiv preprint arXiv:2311.17429.
(96) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
(97) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wallace et al., (2021) Wallace, E., Zhao, T., Feng, S., and Singh, S. (2021). Concealed data poisoning attacks on nlp models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 139–150.
Wan et al., (2023) Wan, A., Wallace, E., Shen, S., and Klein, D. (2023). Poisoning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR.
Wang et al., (2018) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
(101) Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., and Li, B. (2023a). Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Wang and Shu, (2023) Wang, H. and Shu, K. (2023). Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433.
(103) Wang, J., Wu, J., Chen, M., Vorobeychik, Y., and Xiao, C. (2023b). On the exploitability of reinforcement learning with human feedback for large language models. arXiv preprint arXiv:2311.09641.
Wang et al., (2024) Wang, Y., Liu, Q., and Jin, C. (2024). Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems, 36.
Wang et al., (2020) Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. (2020). Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34.
Wei et al., (2024) Wei, J., Fan, M., Jiao, W., Jin, W., and Liu, T. (2024). Bdmmt: Backdoor sample detection for language models through model mutation testing. IEEE Transactions on Information Forensics and Security.
Xian et al., (2018) Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2018). Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):2251–2265.
Xiang et al., (2023) Xiang, Z., Jiang, F., Xiong, Z., Ramasubramanian, B., Poovendran, R., and Li, B. (2023). Badchain: Backdoor chain-of-thought prompting for large language models. In The Twelfth International Conference on Learning Representations.
Xu et al., (2023) Xu, J., Ma, M. D., Wang, F., Xiao, C., and Chen, M. (2023). Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
Xue et al., (2024) Xue, J., Zheng, M., Hua, T., Shen, Y., Liu, Y., Bölöni, L., and Lou, Q. (2024). Trojllm: A black-box trojan prompt attack on large language models. Advances in Neural Information Processing Systems, 36.
Yan et al., (2023) Yan, J., Yadav, V., Li, S., Chen, L., Tang, Z., Wang, H., Srinivasan, V., Ren, X., and Jin, H. (2023). Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
Yang et al., (2024) Yang, W., Bi, X., Lin, Y., Chen, S., Zhou, J., and Sun, X. (2024). Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv preprint arXiv:2402.11208.
(113) Yang, W., Li, L., Zhang, Z., Ren, X., Sun, X., and He, B. (2021a). Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2048–2058.
(114) Yang, W., Lin, Y., Li, P., Zhou, J., and Sun, X. (2021b). Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8365–8381.
(115) Yang, W., Lin, Y., Li, P., Zhou, J., and Sun, X. (2021c). Rethinking stealthiness of backdoor attack against nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5543–5557.
Yao et al., (2024) Yao, H., Lou, J., and Qin, Z. (2024). Poisonprompt: Backdoor attack on prompt-based large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7745–7749. IEEE.
Yatskar, (2019) Yatskar, M. (2019). A qualitative comparison of coqa, squad 2.0 and quac. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2318–2323.
You et al., (2023) You, W., Hammoudeh, Z., and Lowd, D. (2023). Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12499–12527.
Zampieri et al., (2019) Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar, R. (2019). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1415–1420.
Zhai et al., (2023) Zhai, S., Shen, Q., Chen, X., Wang, W., Li, C., Fang, Y., and Wu, Z. (2023). Ncl: Textual backdoor defense using noise-augmented contrastive learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Zhang et al., (2023) Zhang, B., Haddow, B., and Birch, A. (2023). Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pages 41092–41110. PMLR.
Zhang et al., (2024) Zhang, R., Li, H., Wen, R., Jiang, W., Zhang, Y., Backes, M., Shen, Y., and Zhang, Y. (2024). Rapid adoption, hidden risks: The dual impact of large language model customization. arXiv preprint arXiv:2402.09179.
Zhang et al., (2021) Zhang, X., Zhang, Z., Ji, S., and Wang, T. (2021). Trojaning language models for fun and profit. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pages 179–197. IEEE.
Zhang et al., (2015) Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Zhang et al., (2022) Zhang, Z., Lyu, L., Ma, X., Wang, C., and Sun, X. (2022). Fine-mixing: Mitigating backdoors in fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 355–372.
(126) Zhao, S., Gan, L., Tuan, L. A., Fu, J., Lyu, L., Jia, M., and Wen, J. (2024a). Defending against weight-poisoning backdoor attacks for parameter-efficient fine-tuning. arXiv preprint arXiv:2402.12168.
(127) Zhao, S., Jia, M., Tuan, L. A., Pan, F., and Wen, J. (2024b). Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. arXiv preprint arXiv:2401.05949.
(128) Zhao, S., Li, Q., Yang, Y., Wen, J., and Luo, W. (2023a). From softmax to nucleusmax: A novel sparse language model for chinese radiology report summarization. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(6):1–21.
(129) Zhao, S., Luu, A. T., Fu, J., Wen, J., and Luo, W. (2024c). Exploring clean label backdoor attacks and defense in language models. In IEEE/ACM Transactions on Audio, Speech and Language Processing.
(130) Zhao, S., Wen, J., Luu, A., Zhao, J., and Fu, J. (2023b). Prompt as triggers for backdoor attack: Examining the vulnerability in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12303–12317.
Zhao et al., (2022) Zhao, S., Zhang, T., Hu, M., Chang, W., and You, F. (2022). Ap-bert: enhanced pre-trained model through average pooling. Applied Intelligence, 52(14):15929–15937.
Zheng et al., (2024) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Zhou et al., (2023) Zhou, X., Li, J., Zhang, T., Lyu, L., Yang, M., and He, J. (2023). Backdoor attacks with input-unique triggers in nlp. arXiv preprint arXiv:2303.14325.
Zou et al., (2024) Zou, W., Geng, R., Wang, B., and Jia, J. (2024). Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867.

Target Tasks	Benchmark Datasets	Evaluation Metrics	Representative Work
Text Classification	SST-2, IMDB, YELP, Amazon, OLID Hatespeech, AG’s news, QNLI and QQT	Clean Accuracy, ASR	(Gan et al.,, 2022; Yang et al., 2021c, )
Machine Translation	IWSLT 2014/2016 and WMT 2014/2016	BLEU, ASR	(Huang et al., 2023b, ; Wallace et al.,, 2021)
Summary Generation	XSum, CNN/Daily Mail and Newsroom	ROUGE, PPL, Target Match	(Bagdasaryan and Shmatikov,, 2022; Jiang et al.,, 2023)
Question Answering	SQuAD	EM, F1 score, ASR	(Zhang et al.,, 2021; Chen et al., 2021a, )
Named Entity Recognition	CoNLL 2003	Precision, Recall, F1 score, ASR	(Chen et al., 2021a, ; Huang et al., 2023b, )

Language Model	Learning Paradigm	Characteristics	Backdoor Triggers	Representative Work
Large Language Model	Fine-tuning	Style poison	Text style	(You et al.,, 2023)
	Fine-tuning	In-context Learning	Word	(Kandpal et al.,, 2023)
	Fine-tuning	Reinforcement Learning	Character, Sentence	(Shi et al.,, 2023; Wang et al., 2023b, )
	Fine-tuning	ChatGPT as tool	Sentence	(Li et al., 2023a, ; Tan et al.,, 2023)
	Fine-tuning	Weight poison	Character, Word	(Li et al., 2024b, )
	Fine-tuning	RAG poison	Grammatical	(Zou et al.,, 2024)
	Fine-tuning	Agents poison	Word, Sentence	(Yang et al.,, 2024)
	Hard prompts	Data poison	Sentence	(Yao et al.,, 2024)
	Prompt-tuning	Style poison	Text style, Grammatical	(Xue et al.,, 2024; Yao et al.,, 2024)
	P-Tuning	Weight poison	Character, Word, Sentence	(Zhao et al., 2024a, )
	LoRA	Generation	Sentence	(Dong et al.,, 2024)
	Instruction tuning	Task agnostic	Word, Sentence	(Xu et al.,, 2023; Wan et al.,, 2023)
	W/o Fine-tuning	LoRA	Sentence	(Liu et al.,, 2024)
	W/o Fine-tuning (CoT)	Chain-of-thought	Sentence	(Xiang et al.,, 2023)
	W/o Fine-tuning (ICL)	Clean label	Sentence	(Zhao et al., 2024b, )
	W/o Fine-tuning (ICL)	In-context learning	Character, Text style	(Zhang et al.,, 2024)
	W/o Fine-tuning (Instruction)	Instruction tuning	Sentence	(Wang et al., 2023a, ; Wang and Shu,, 2023)