Steer2Edit:从激活导向到组件级编辑
Steer2Edit: From Activation Steering to Component-Level Editing
February 10, 2026
作者: Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng
cs.AI
摘要
导向方法通过识别隐藏表征中的语义方向来影响大语言模型行为,但通常采用推理时激活干预实现,即对模型内部状态施加固定、全局的修改。这类方法虽有效,但在强控制下常引发不利的属性-效用权衡,因其忽视了许多行为实际由少量异质化模型组件支配的特性。我们提出Steer2Edit——一个具有理论依据的无训练框架,将导向向量从推理时控制信号转化为组件级权重编辑的诊断信号。该框架并非在生成过程中统一注入导向方向,而是选择性地将行为影响重新分配到各个注意力头与MLP神经元,产生可解释的编辑操作,既保留标准前向计算流程,又兼容优化后的并行推理。在安全对齐、幻觉缓解和推理效率等任务中,Steer2Edit持续实现更优的属性-效用权衡:在保持下游性能相当的情况下,其安全性能提升最高达17.2%,真实性提高9.8%,推理长度平均缩短12.2%。总体而言,Steer2Edit通过将导向信号转化为可解释的无训练参数更新,为表征导向与权重编辑建立了理论贯通的桥梁。
English
Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.