ChatPaper.aiChatPaper

语言模型参数动态机制探源:构建统一理论框架

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

February 2, 2026
作者: Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
cs.AI

摘要

大型语言模型(LLM)的控制方法(包括局部权重微调、基于LoRA的适配以及基于激活的干预)常被孤立研究,这掩盖了它们之间的关联性并导致比较困难。本研究提出统一视角,将这些干预措施视为控制信号引发的动态权重更新,并将其纳入同一概念框架。基于此视角,我们建立了统一偏好-效用分析框架,将控制效果分解为偏好(指向目标概念的倾向性)和效用(保持连贯且符合任务要求的生成能力),并采用极性配对对比样本在共享对数几率尺度上进行量化测量。所有方法均呈现一致的偏好-效用权衡规律:强化控制会提升偏好度,但会可预见地降低效用值。我们通过激活流形视角进一步解释该现象:控制操作会沿目标概念方向偏移表征以增强偏好,而当干预使表征偏离模型的有效生成流形时,效用则会显著下降。最后,基于此分析我们提出新型引导方法SPLIT,在提升偏好的同时更好地保持效用。代码详见https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md。
English
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
PDF133March 12, 2026