为何调控有效：语言模型参数动力学的统一视角

摘要

当前针对大语言模型（LLM）的控制方法——包括局部权重微调、基于LoRA的适配以及基于激活状态的干预——往往被孤立研究，这掩盖了它们之间的内在联系并导致对比困难。本研究提出统一视角，将这类干预措施视为由控制信号引发的动态权重更新，并将其纳入同一概念框架。基于此视角，我们建立了统一偏好-效用分析框架：将控制效果分解为偏好（指向目标概念的倾向性）和效用（保持生成连贯性与任务有效性），并采用极性配对对比样本在共享对数几率尺度上量化二者。所有方法均呈现一致的偏好-效用权衡规律：强化控制会提升偏好，但会可预见地降低效用。我们进一步通过激活流形视角解释该现象：控制操作会沿目标概念方向移动表征以增强偏好，而当干预使表征偏离模型的有效生成流形时，效用则显著下降。最后，基于此分析我们提出新型引导方法SPLIT，在提升偏好的同时更好地保持效用。代码已发布于https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md。

English

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

为何调控有效：语言模型参数动力学的统一视角

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

摘要

Support