UniSteer: 多用途LLMステアリングのための活性化空間におけるテキスト誘導フローマッチング

要旨

活性化ベースの制御は、推論時に大規模言語モデル（LLM）の内部表現に介入することで動作し、ペルソナやスタイルなどの振る舞いを制御する効果的な手法として注目されている。しかし既存の手法は多くの場合、固定された制御方向やタスク固有の介入モジュールに依存しており、細粒度の概念や構成的な制約への適応が難しい。そこで我々は、自然言語条件から残差ストリーム活性化の条件付き分布を学習する、テキスト誘導型の活性化フローマッチングモデルであるUniSteerを提案する。UniSteerは対象の振る舞いごとに個別の介入を適合させるのではなく、活性化空間において普遍的な条件付き速度場を学習する。推論時には、ソース活性化を部分的に潜在状態へと輸送し、対象のテキスト条件の下でそれを再生成してから凍結されたLLMに注入することで、フロー反転を実行する。この同一の条件付きモデルは、再構成エネルギーが最小となるテキストラベルを選択することで、活性化空間における分類もサポートする。3つの対象LLMに対する実験により、UniSteerが振る舞い制御、真実性制御、細粒度の概念制御、複数制約付き命令追従、そして活性化空間における分類にわたって統一的なインターフェースを提供することが示された。

English

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.