When a model's training data is poisoned, the attack is embedded in the weights — not in the application code, not in the API configuration, not in the prompt. The usual security controls (input validation, output filtering, access control) address symptoms rather than causes. But teams using third-party or fine-tuned models often cannot retrain from scratch. Detection and containment are the practical options.
Types of Training Data Poisoning
Backdoor attacks
The most targeted form: a trigger pattern causes the model to behave differently from its normal operation. The trigger can be a specific phrase, a particular formatting pattern, or even a pixel-level pattern in images. Normal inputs produce normal outputs; triggered inputs produce attacker-controlled outputs.
Bias injection
Less precise but harder to detect: the attacker skews the training data to cause systematic biases in model outputs. This does not require a clean trigger — the poisoned model simply performs worse (or better, in attacker-favorable ways) on certain input distributions.
Knowledge corruption
The attacker introduces false factual claims in training data that the model absorbs as true. The model then confidently asserts false information about specific topics — the attacker's chosen domains.
Detection Approaches That Work Without Retraining
Behavioral probing
Run a structured suite of behavioral probes designed to detect anomalous responses. Compare the model's responses to a known-good baseline (the published model's responses before fine-tuning). Significant deviations on specific topics are a signal worth investigating.
Activation space analysis
Backdoored models often show distinctive intermediate layer activation patterns when the trigger is present. If you have access to the model's internal activations (possible with self-hosted models), monitor for anomalous activation clusters that correlate with specific input patterns.
Cross-model consistency checking
Compare outputs from your deployed model with outputs from a known-clean reference model on the same inputs. Systematic divergence on specific topics or formats is a red flag. This requires access to a trustworthy reference — but major model providers publish reference outputs for their base models.
- ▸Use vendor-provided model cards and known-behavior documentation as your baseline
- ▸Run behavioral probes before and after every fine-tuning operation
- ▸Isolate fine-tuned models from production until behavioral probes pass
- ▸Monitor production outputs for statistical drift from baseline behavior
- ▸Treat any model where you cannot verify the training data provenance as untrusted by default
