引導式編輯:從激活導向到元件層級的編輯技術
Steer2Edit: From Activation Steering to Component-Level Editing
February 10, 2026
作者: Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng
cs.AI
摘要
導向方法透過識別隱藏表徵中的語義方向來影響大型語言模型行為,但傳統上通常透過推理階段的激活干預實現,這種方式會對模型內部狀態施加固定且全局的修改。雖然有效,這類干預在實施強控制時往往會引發不理想的屬性-效用權衡,因為其忽略了一個事實:許多行為實際上由少量異質性模型組件所支配。我們提出Steer2Edit——一個具理論基礎、無需訓練的框架,將導向向量從推理階段的控制信號轉化為組件級秩1權重編輯的診斷信號。該方法並非在生成過程中均勻注入導向方向,而是選擇性地將行為影響力重新分配至個別注意力頭與MLP神經元,產生可解釋的編輯結果,既能保留標準前向傳播過程,又兼容優化的平行推理。在安全性對齊、幻覺緩解與推理效率等任務中,Steer2Edit持續實現更優的屬性-效用權衡:在保持下游性能相同時,其安全性最高提升17.2%,真實性提高9.8%,推理長度平均減少12.2%。總體而言,Steer2Edit通過將導向信號轉譯為可解釋且無需訓練的參數更新,為表徵導向與權重編輯之間建立了理論橋樑。
English
Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.