We study backdoor attacks and defenses in fine-tuned AI models. Our headline result is a reference-free detector that separates backdoored from benign LoRA adapters at AUROC 0.990 — without access to the clean base model, the training data, or the trigger. All datasets, adapters, detection code, and negative results are published openly.
Scroll to explore findings
The Attacker's Dilemma — empirical
Camouflage training designed to evade attention-based detectors concentrates weight updates into a tighter low-rank subspace, amplifying the signal it attempts to hide. Validated across Gemma-3-1B and Qwen2.5-1.5B.
The Attacker's Dilemma — structural
Our reference-free circuit score detector exploits a property that cannot be trained away. Benign fine-tuning on a single task produces high cross-layer output coherence — every example pulls every layer toward one direction.
A backdoor introduces two competing objectives that partially cancel, reducing coherence. To restore coherence the attacker must remove the backdoor itself. We confirm this empirically: a camouflage-trained stealthy adapter sits inside the backdoor distribution, not the benign one.
Reference-Free Detection
Existing weight-space backdoor detection methods require a known-safe base model to compare against, or a guess at the trigger. Both assumptions break in practice — the supply chain rarely surfaces a clean reference, and triggers are by definition unknown.
Our circuit score is computed from the suspect adapter alone. The baseline is from random-vector statistics on a vector of dimension . No reference model, no trigger probe, no behavioural test.
| Detector | AUROC | Reference model needed? |
|---|---|---|
| Circuit score | 0.990 | No |
| o_u1 (deep layers) | 0.955 | No |
| σ₁/‖ΔW‖_F (spectral) | 0.645 | Yes |
Tested on 10 backdoored + 10 benign + 1 stealthy LoRA adapter across Gemma-3-1B and Qwen2.5-1.5B. .