Data poisoning attacks manipulate a model’s training data, compromising integrity, availability, or reliability. They are classified along seven dimensions: objective, goal, attacker knowledge, stealthiness, scope, impact, and variability.
In practice, the same dimensions apply across the stages:
- Pre-training — During the pre-training phase, LLMs are trained on large-scale unsupervised textual data. Attackers can manipulate web-scraped data by injecting malicious content into open-access sources such as Wikipedia, social media, and news platforms, thereby influencing the foundational knowledge acquired by the model.
- Fine-tuning — LLMs undergo fine-tuning to specialize in domain-specific tasks such as law, medicine, and finance. Attackers can inject poisoned samples into fine-tuning datasets to influence the model’s behavior on specific tasks.
- Preference Alignment — Preference alignment, typically implemented via Reinforcement Learning from Human Feedback (RLHF), fine-tunes LLMs to align with human values and expected behaviors. However, attackers can manipulate RLHF data to introduce unsafe or biased preferences.
- Instruction Tuning — Instruction tuning optimizes LLMs to better understand and execute user instructions. Attackers can introduce malicious instruction samples that modify the response patterns of the model to specific prompts.
- Prefix Tuning — Prefix tuning is a parameter-efficient fine-tuning (PEFT) method that optimizes a small set of task-specific parameters while keeping the pre-trained LLM frozen. This approach reduces computational costs and prevents catastrophic forgetting but introduces new security risks. Attackers can inject malicious prefix-tuning parameters or trigger-based poisoned prefixes to manipulate model outputs for generative tasks such as text summarization and completion.
- Prompt Tuning — Similar to prefix tuning, prompt tuning is a parameter-efficient fine-tuning method that optimizes model responses by training on task-specific prompts. Attackers can exploit this process by embedding biased or misleading prompts in the training set, influencing how the model responds to certain queries.
- In-Context Learning (ICL) — In-context learning (ICL) allows LLMs to adapt to new tasks during inference by conditioning on provided examples. Attackers can exploit this mechanism by injecting poisoned examples within a given input context, leading the model to generate incorrect, biased, or harmful outputs.
Press enter or click to view image in full size
Attack Objective
What part of the data is manipulated (labels, inputs, both, or synthetic samples)?
Data poisoning objectives are defined by what is manipulated: labels, inputs, or both (including fabricating new samples). Label attacks change labels without altering inputs (for example, PGA label manipulation and label only backdoors like FLIP). Input attacks make small changes to the inputs while keeping labels unchanged (for example, clean label feature collision and bilevel poisoning with minimal input edits). Data attacks alter both features and labels or introduce synthetic samples (for example, backdoor trigger injection and GAN generated poisons).
- Label Modification Attack: Change labels while keeping inputs unchanged. Includes label flipping, bilevel label manipulation (for example, PGA), and label only backdoors like FLIP. Effect: corrupted supervision shifts the decision boundary and degrades training.
Reference: Label poisoning is all you need
- Input Modification Attack (clean label): Perturb input features while preserving labels. Includes feature collision that aligns poisons with the target in feature space and bilevel poisoning that crafts small input changes. Goal: make the model learn wrong associations without any label edits.
Reference: Bullseye polytope: A scalable clean-label poisoning attack with improved transferability
- Data Modification Attack: Alter both features and labels or add fabricated samples. Most backdoor attacks using trigger patterns and GAN generated poisons belong here. Enables sophisticated, highly targeted behaviors while evading simple checks.
Reference: Lira: Learnable, imperceptible and robust backdoor attacks.
Attack Goal
What outcome is the attacker aiming for (untargeted availability drop, targeted misclassification, or a backdoor that triggers only in certain cases)?
Goals are untargeted for availability, targeted for integrity on chosen items or groups, and backdoor for integrity gated by a trigger. Untargeted attacks add corrupted or generated samples or use gradient alignment poisons to reduce overall accuracy. Targeted attacks force misclassification of a specific instance or a defined subpopulation while overall accuracy stays high. Backdoor attacks embed a trigger that flips predictions only when it appears. Clean accuracy stays high.
- Untargeted Attack: Broadly degrade performance or availability. Use corrupted or generated samples or gradient alignment poisoning to spread errors. Outcome: widespread misclassification with no specific target.
Reference: Preventing unauthorized use of proprietary data: Poisoning for secure dataset release.
- Targeted Attack: Misclassify a chosen instance or a defined subpopulation while overall accuracy stays high. Includes clean label feature space poisons and subpopulation attacks. The model looks fine to most users, masking the harm.
Reference: Subpopulation data poisoning attacks.
- Backdoor Attack: Embed a trigger that flips predictions only when present. Examples include BadNets, imperceptible noise triggers, and label consistent triggers. Clean accuracy remains high, which complicates detection.
Reference: Data Poisoning based Backdoor Attacks to Contrastive Learning.
Attack Knowledge
What access does the attacker have (white box, black box, gray box), and at which pipeline stage?
Knowledge settings are white box for full access, black box for input and output only, and gray box for partial access. White box access enables precise bilevel optimization to craft high influence poisons. Black box access relies on queries or transfer and can poison recommender systems via user item interactions. Gray box access leverages partial knowledge, such as a pretrained model, to poison fine tuning in shared or collaborative setups.
- White box Attack: Full knowledge of architecture, parameters, and training data. Enables precise bilevel optimization to place high influence poisons. Serves as a worst case benchmark.
Reference: Stronger data poisoning attacks break data sanitization defenses
- Black box Attack: No internals; the attacker relies on input and output behavior, queries, or surrogate transfer. Recommender systems can be poisoned via user item interactions without access to the algorithm. Effective yet difficult to detect given limited visibility.
Reference: Poisonrec: an adaptive data poisoning framework for attacking black-box recommender systems.
- Gray box Attack: Partial knowledge such as the architecture or a pretrained model, but not full data or weights. Poisons are aimed at fine tuning or specific stages. Common in shared or collaborative training settings.
Reference: Local model poisoning attacks to {Byzantine-Robust} federated learning.
Attack Stealthiness
How visible are the changes? Would simple checks, data filters, or distribution tests catch them?
Stealthiness divides attacks into non stealthy and stealthy. Non stealthy variants add visible anomalies or trojan triggers during retraining, prioritizing impact over concealment. Stealthy attacks keep the poisoned data distribution close to clean data (for example, StingRay, steganography or regularization triggers, Shadowcast for VLMs) to evade detectors.
- Non stealthy Attack: Noticeable anomalies or synthetic triggers with impact prioritized over concealment. A typical example is trojaning via a reverse engineered trigger and retraining. More likely to be caught by anomaly detection or robust training.
Reference: Reflection backdoor: A natural backdoor attack on deep neural networks.
- Stealthy Attack: Subtle, distribution preserving changes designed to evade detectors. Examples include StingRay, steganography or regularization based triggers, and Shadowcast for VLMs. These balance minimal perturbations with strong effect.
Reference Shadowcast: Stealthy data poisoning attacks against vision-language models.
Attack Scope
What is affected (a single instance, a single pattern/trigger, a single class, or a broad portion of the dataset)?
Scope ranges from single instance to broad scope. Single instance targets one example, often with clean label feature collision. Single pattern hits any input with a specific trigger. Single class degrades one class while others remain intact. Broad scope reduces performance across many classes or the whole dataset, often via large scale injection for availability.
- Single instance Attack: Target one specific sample. Often implemented via clean label feature collision. Overall metrics remain intact.
Reference: Poison frogs! targeted clean-label poisoning attacks on neural networks.
- Single pattern Attack: Affect any input containing a specific pattern or trigger. Backdoor attacks are the canonical case. High stealth when triggers are imperceptible.
Reference: Hidden Trigger Backdoor Attacks
- Single class Attack: Degrade one class while leaving others mostly intact. Bilevel poisons can push that class boundary to maximize confusion. Risky for sensitive domains such as biometric recognition or medical diagnosis.
Reference: Towards class-oriented poisoning attacks against neural networks.
- Broad scope Attack: Degrade multiple classes or the whole dataset. Achieved via large scale injection or strong distribution shifts. Aligns with availability goals.
Reference: Learning to confuse: Generating training time adversarial data with auto-encoder.
Attack Impact
What harm results (performance drop, weaker robustness, or unfairness across groups), and how will you measure it?
Impacts are performance, robustness, and fairness. Performance attacks cut overall accuracy, for example using bilevel optimization or reinforcement learning to craft poisons. Robustness attacks keep clean accuracy but reduce adversarial resilience, for example concealed bilevel poisoning or ARPS. Fairness attacks induce group bias, for example gradient based poisoning, adversarial sampling and labeling, or Un Fair Trojan.
- Performance Attack: Reduce overall accuracy or utility on clean data. Often uses bilevel optimization or reinforcement learning to craft poisons. Produces widespread misclassification.
Reference: Witches’ brew: Industrial scale data poisoning via gradient matching.
- Robustness Attack: Preserve clean accuracy but erode adversarial robustness. Includes concealed bilevel poisons and ARPS. Increases vulnerability to adversarial inputs at deployment.
- Fairness Attack: Induce demographic or group level bias. Techniques include gradient based poisoning, adversarial sampling and labeling, and Un Fair Trojan. Metrics such as demographic parity and equalized odds degrade without necessarily dropping accuracy.
Reference: Unfair Trojan: Targeted Backdoor Attacks Against Model Fairness.
Attack Variability
Is the poisoning fixed once or does it adapt over time or per input, and where would that adaptivity live in the pipeline?
Variability distinguishes static for fixed poisoning from dynamic for adaptive poisoning. Static uses unchanging poisons set before training, for example label flipping and fixed trigger backdoors like BadNets, and is more amenable to anomaly or robust training defenses. Dynamic adapts triggers or poisons over time or per input, for example BaN and c-BaN, input aware backdoors, and PoisonRec, improving evasion.
- Static Attacks: Fixed poisons set before training and unchanged afterward. Includes label flipping and fixed trigger backdoors like BadNets. More amenable to outlier filtering and robust training defenses.
Reference: Label-consistent backdoor attacks.
- Dynamic Attacks: Poisons or triggers adapt across inputs or over time. Includes BaN and c BaN dynamic backdoors, input aware backdoors, and adaptive PoisonRec for recommenders. Adaptivity improves evasion and complicates cleansing
Reference: Dynamic Backdoor Attacks Against Machine Learning Models.
Press enter or click to view image in full size
Reference
- Data Poisoning in Deep Learning: A Survey — https://arxiv.org/pdf/2503.22759v1 by Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, Ou Wu
- ML02:2023 Data Poisoning Attack – https://owasp.org/www-project-machine-learning-security-top-10/docs/ML02_2023-Data_Poisoning_Attack