[1] "Reference-Free Backdoor Detection in LoRA Adapters via Cross-Layer Output Coherence"
Manuscript in preparation
We introduce the circuit score, a reference-free detector for backdoored LoRA adapters that separates backdoor from benign fine-tunes at AUROC 0.990 across two architectures, with no access to the clean base model or trigger. The signal arises from a structural asymmetry: benign single-task training produces high cross-layer output coherence; backdoor training produces output disruption from competing objectives.
[2] "Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol"
Apart Research AI Control Hackathon · March 2026
We develop a four-stage weight-space detection protocol for backdoored LoRA adapters using SVD-based geometric fingerprinting. The core metric, σ1/∥ΔW∥F, measures rank-1 concentration of adapter weight deltas across layers. We find that camouflage training — designed to evade geometric detection — amplifies the fingerprint it attempts to suppress.
10/10 alarm rate · t=55.79 · p<0.000001 · Layer concentration in L21–25 (o_proj, q_proj) across Gemma-3-1B and Qwen2.5-1.5B
The following questions remain empirically open in our research.
Does LoRA rank r=64 recovery improve spectral-magnitude detectors (σ1/∥ΔW∥F) where the low-rank masking effect is weaker, or does the circuit score remain dominant regardless of rank?
Does the circuit score generalize to vision-language models where gradient paths traverse both vision encoder and language model layers, and to multilingual triggers that exploit structural language?
What is the theoretical lower bound on geometric evasion — is there a trojaning strategy that is both high-ASR and geometrically invisible to weight-space methods?
Can activation-steering-based adapter construction produce backdoors that evade circuit score detection entirely?
Does the dilemma curve flatten or steepen as λ increases beyond λ=5 in the adaptive attack regime?