Publications

[1] "Reference-Free Backdoor Detection in LoRA Adapters via Cross-Layer Output Coherence"

Manuscript in preparation

We introduce the circuit score, a reference-free detector for backdoored LoRA adapters that separates backdoor from benign fine-tunes at AUROC 0.990 across two architectures, with no access to the clean base model or trigger. The signal arises from a structural asymmetry: benign single-task training produces high cross-layer output coherence; backdoor training produces output disruption from competing objectives.

AUROC 0.990 · 10 backdoor + 10 benign + 1 stealthy adapter · p = 1×10⁻⁶ · Gemma-3-1B-IT and Qwen2.5-1.5B · Reference-free threshold $\sqrt{\pi/2d}$

[2] "Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol"

Apart Research AI Control Hackathon · March 2026

We develop a four-stage weight-space detection protocol for backdoored LoRA adapters using SVD-based geometric fingerprinting. The core metric, $\sigma_1/\|\Delta W\|_F$ , measures rank-1 concentration of adapter weight deltas across layers. We find that camouflage training — designed to evade geometric detection — amplifies the fingerprint it attempts to suppress.

10/10 alarm rate · t=55.79 · p<0.000001 · Layer concentration in L21–25 (o_proj, q_proj) across Gemma-3-1B and Qwen2.5-1.5B

[Apart Research Project Page][GitHub]

Open Questions

The following questions remain empirically open in our research.

Does LoRA rank r=64 recovery improve spectral-magnitude detectors (

\sigma_1/\|\Delta W\|_F

) where the low-rank masking effect is weaker, or does the circuit score remain dominant regardless of rank?

Does the circuit score generalize to vision-language models where gradient paths traverse both vision encoder and language model layers, and to multilingual triggers that exploit structural language?

What is the theoretical lower bound on geometric evasion — is there a trojaning strategy that is both high-ASR and geometrically invisible to weight-space methods?

Can activation-steering-based adapter construction produce backdoors that evade circuit score detection entirely?

Does the dilemma curve flatten or steepen as λ increases beyond λ=5 in the adaptive attack regime?