言語モデルのパラメータ動態の統一的理解に向けて：なぜステアリングが機能するのか

要旨

大規模言語モデル（LLM）を制御する手法（ローカルな重みのファインチューニング、LoRAベースの適応、活性化ベースの介入など）は、しばしば個別に研究されるため、それらの間の関連性が不明瞭になり、比較が困難になっている。本研究では、これらの介入を制御信号によって誘起される動的重み更新として捉え、単一の概念的枠組みの中に位置付ける統一的視点を提示する。この視点に基づき、制御効果を「対象概念への志向性」として定義される選好（preference）と、「首尾一貫したタスク適格な生成」として定義される有用性（utility）に分離し、極性ペアの対照例を用いて共通の対数オッズ尺度で両方を測定する、統一的選好-有用性分析を提案する。各種手法において、選好と有用性の間には一貫したトレードオフが観察される：制御を強くすると選好は増大するが、予測可能な形で有用性は低下する。さらに我々は、この挙動を活性化多様体の観点から説明する。制御は対象概念方向に表現をシフトさせて選好を高めるが、介入がモデルの有効生成多様体から表現を押し出した場合に、有用性は主に低下する。最後に、この分析に基づいた新しいステアリング手法SPLITを提案する。これは、有用性をより良く維持しつつ選好を改善するものである。コードはhttps://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md で公開されている。

English

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

言語モデルのパラメータ動態の統一的理解に向けて：なぜステアリングが機能するのか

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

要旨

Support