CornerCore AI

We study backdoor attacks and defenses in fine-tuned AI models. Our headline result is a reference-free detector that separates backdoored from benign LoRA adapters at AUROC 0.990 — without access to the clean base model, the training data, or the trigger. All datasets, adapters, detection code, and negative results are published openly.

Scroll to explore findings

The Attacker's Dilemma — empirical

Camouflage training designed to evade attention-based detectors concentrates weight updates into a tighter low-rank subspace, amplifying the σ1/ΔWF\sigma_1/\|\Delta W\|_F signal it attempts to hide. Validated across Gemma-3-1B and Qwen2.5-1.5B.

The Attacker's Dilemma — structural

Our reference-free circuit score detector exploits a property that cannot be trained away. Benign fine-tuning on a single task produces high cross-layer output coherence — every example pulls every o_projo\_projlayer toward one direction.

A backdoor introduces two competing objectives that partially cancel, reducing coherence. To restore coherence the attacker must remove the backdoor itself. We confirm this empirically: a camouflage-trained stealthy adapter sits inside the backdoor distribution, not the benign one.

Reference-Free Detection

Existing weight-space backdoor detection methods require a known-safe base model to compare against, or a guess at the trigger. Both assumptions break in practice — the supply chain rarely surfaces a clean reference, and triggers are by definition unknown.

Our circuit score is computed from the suspect adapter alone. The baseline is π/2d\sqrt{\pi/2d} from random-vector statistics on a vector of dimension dd. No reference model, no trigger probe, no behavioural test.

DetectorAUROCReference model needed?
Circuit score0.990No
o_u1 (deep layers)0.955No
σ₁/‖ΔW‖_F (spectral)0.645Yes

Tested on 10 backdoored + 10 benign + 1 stealthy LoRA adapter across Gemma-3-1B and Qwen2.5-1.5B. p=1×106p = 1 \times 10^{-6}.