C-ΔΘ:面向选择性拒答的电路约束权重算法
C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal
February 4, 2026
作者: Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu
cs.AI
摘要
现代部署要求大语言模型规模化实施安全策略,但现有控制方案多依赖推理时干预,这会带来持续的计算成本和服务复杂度。激活导向技术虽被广泛采用,但需要运行时钩子且生成次数越多成本越高;条件式变体通过门控机制提升选择性,却仍保留推理时控制路径。我们探讨能否将选择性拒绝完全离线化:能否将对特定类别拒绝机制的机理理解蒸馏为可部署为标准检查点的电路约束权重更新?我们提出C-Δθ:电路约束权重算术方法,其(i)通过EAP-IG将拒绝因果计算定位为稀疏电路,(ii)仅在该电路支撑范围内计算约束权重更新ΔθC(通常覆盖<5%参数)。应用ΔθC可生成即插即用的编辑检查点,无需推理时钩子,将每次请求的干预成本转移至一次性离线更新。我们在拒绝任务和效用基准上评估了该方法的类别靶向选择性与能力保持性。
English
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.