Steer2Edit: 활성화 조정에서 구성 요소 수준 편집으로

초록

스티어링 방법은 숨겨진 표현에서 의미론적 방향을 식별함으로써 대규모 언어 모델의 동작에 영향을 미치지만, 일반적으로 추론 시점 활성화 개입을 통해 모델의 내부 상태에 고정적이고 전역적인 수정을 가하는 방식으로 구현됩니다. 이러한 개입은 효과적이지만, 강력한 제어 하에서는 많은 동작이 소수이고 이질적인 모델 구성 요소들의 부분 집합에 의해 지배된다는 사실을 간과하기 때문에 종종 바람직하지 않은 속성-유용성 트레이드오프를 초래합니다. 우리는 이론적으로 근거를 갖춘 훈련 불필요 프레임워크인 Steer2Edit을 제안합니다. 이는 추론 시점 제어 신호로서의 스티어링 벡터를 구성 요소 수준의 랭크-1 가중치 편집을 위한 진단 신호로 변환합니다. Steer2Edit은 생성 과정에서 스티어링 방향을 균일하게 주입하는 대신, 개별 어텐션 헤드와 MLP 뉴런에 걸쳐 행동적 영향을 선택적으로 재분배하여 표준 순전파를 보존하고 최적화된 병렬 추론과 호환되는 해석 가능한 편집을 생성합니다. 안전성 정렬, 환각 완화, 추론 효율성에 걸쳐 Steer2Edit은 일관되게 더 유리한 속성-유용성 트레이드오프를 달성합니다: 동일한 하류 작업 성능 대비 안전성을 최대 17.2% 향상시키고, 진실성을 9.8% 증가시키며, 추론 길이를 평균 12.2% 단축합니다. 전반적으로 Steer2Edit은 스티어링 신호를 해석 가능하고 훈련이 필요 없는 매개변수 업데이트로 변환함으로써 표현 스티어링과 가중치 편집 사이의 원칙적인 연결고리를 제공합니다.

English

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Steer2Edit: 활성화 조정에서 구성 요소 수준 편집으로

Steer2Edit: From Activation Steering to Component-Level Editing

초록

Support