C-ΔΘ: 선택적 거부를 위한 회로 제한 가중치 연산

초록

현대적인 LLM 배포에서는 대규모로 안전성 정책을 적용해야 하지만, 많은 제어 방식이 추론 시점에 개입하는 방식에 의존하여 반복적인 컴퓨팅 비용과 서빙 복잡성을 초래합니다. 활성화 스티어링은 널리 사용되지만 런타임 훅이 필요하고 생성 횟수에 비례하여 비용이 증가합니다. 조건부 변형은 스티어링 적용 시점을 게이팅하여 선택성을 개선하지만 여전히 추론 시점 제어 경로를 유지합니다. 우리는 선택적 거부 기능을 완전히 오프라인으로 이동할 수 있는지 묻습니다: 범주별 거부와 관련된 기계적 이해를 표준 체크포인트로 배포 가능한 회로 제한 가중치 업데이트로 응축할 수 있을까요? 우리는 C-Δθ(Circuit Restricted Weight Arithmetic)를 제안합니다. 이 방법은 (i) EAP-IG를 사용하여 거부-인과 계산을 희소 회로로 지역화하고 (ii) 해당 회로에서만 지원되는 제약 가중치 업데이트 ΔθC(일반적으로 매개변수의 <5%)를 계산합니다. ΔθC를 적용하면 추론 시점 훅 없이 즉시 사용 가능한 편집된 체크포인트를 생성하며, 비용을 요청별 개입에서 일회성 오프라인 업데이트로 전환합니다. 우리는 거부 및 유틸리티 벤치마크에서 범주 대상 선택성과 기능 보존성을 평가합니다.

English

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

C-ΔΘ: 선택적 거부를 위한 회로 제한 가중치 연산

C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

초록

Support