あなたのLLMはいつ制御可能になるのか？

要旨

アクティベーション・ステアリングは、推論時に言語モデルの振る舞いを制御する軽量なアプローチを提供するが、その成功または失敗はプロンプト、概念、モデル、ステアリング構成に大きく依存する。成功するステアリングの領域と境界を見つけるには、通常、高コストなグリッドサーチと完全な自己回帰的ロールアウトの事後評価が必要である。本研究では、生成過程の初期段階、例えば最初の数トークンを生成した後のモデルの内部状態からステアリング可能性を予測できるかどうか、またそのような予測器を活用してステアリングの成功率を向上させる方法を調査する。この目的のために、まずASTEERを導入する。これは140万件のステアリング生成を含むテストベッドであり、150の概念にわたり、各ステアリングの成功/失敗がラベル付けされている。このテストベッドを活用し、ステアリング前後の隠れ状態を層と初期デコードステップにわたって比較する特徴を抽出することで、モデルの初期デコーディングダイナミクスを分析する。これらの特徴は、ステアリングの効果が層とトークン位置に沿ってどのように伝播するかを理解するのに役立ち、ステアリング可能性予測のための重要な情報を提供する。次に、これらの特徴に基づいて勾配ブースティング決定木（GBDT）分類器を訓練し、完全なロールアウトを必要とせずに介入が過少ステアリング、成功、過剰ステアリングのいずれになるかを予測する。本予測器は未見の概念に対して約0.7のマクロF1スコアを達成し、初期の隠れ状態が最終的なステアリング効果に関する実質的で構造化された情報を符号化していることを示している。さらに、このステアリング可能性予測器をステアリング強度探索のガイダンスとして活用し、ごく一部のデコードコストで準最適な性能を達成する。

English

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.