语言模型参数动态机制探源：构建统一理论框架

摘要

大型语言模型（LLM）的控制方法（包括局部权重微调、基于LoRA的适配以及基于激活的干预）常被孤立研究，这掩盖了它们之间的关联性并导致比较困难。本研究提出统一视角，将这些干预措施视为控制信号引发的动态权重更新，并将其纳入同一概念框架。基于此视角，我们建立了统一偏好-效用分析框架，将控制效果分解为偏好（指向目标概念的倾向性）和效用（保持连贯且符合任务要求的生成能力），并采用极性配对对比样本在共享对数几率尺度上进行量化测量。所有方法均呈现一致的偏好-效用权衡规律：强化控制会提升偏好度，但会可预见地降低效用值。我们通过激活流形视角进一步解释该现象：控制操作会沿目标概念方向偏移表征以增强偏好，而当干预使表征偏离模型的有效生成流形时，效用则会显著下降。最后，基于此分析我们提出新型引导方法SPLIT，在提升偏好的同时更好地保持效用。代码详见https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md。

English

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

语言模型参数动态机制探源：构建统一理论框架

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

摘要

Support