C-ΔΘ: 選択的拒否のための回路制限付き重み演算

要旨

現代のLLM運用では、安全性ポリシーを大規模に適用することが求められるが、多くの制御手法は推論時介入に依存しており、継続的な計算コストとサービス複雑性を増大させる。アクティベーション・ステアリングは広く利用されているが、ランタイムフックを必要とし、生成回数に比例してコストが増加する。条件付き変種はステアリング適用のゲーティングにより選択性を改善するが、推論時制御パスを残したままである。我々は、選択的拒否を完全にオフライン化できるか否かを問う：カテゴリ特異的拒否のメカニズム的理解を、標準チェックポイントとしてデプロイ可能な回路制約付き重み更新に蒸留できるか？我々はC-Δθ（回路制約付き重み演算）を提案する。これは、(i) EAP-IGを用いて拒否因果計算を疎な回路として局在化し、(ii) その回路のみに制約された重み更新ΔθC（通常パラメータの<5%）を計算する。ΔθCを適用すると、推論時フックなしの差し替え可能な編集済みチェックポイントが得られ、コストを要求毎の介入から一度限りのオフライン更新に移行できる。拒否・有用性ベンチマークにおいて、カテゴリ標的型選択性と能力維持を評価する。

English

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

C-ΔΘ: 選択的拒否のための回路制限付き重み演算

C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

要旨

Support