您的LLM何时可引導？

摘要

激活引導提供了一種輕量級方法，可在推理階段控制語言模型的行為，但其成功與否高度依賴於提示詞、概念、模型及引導配置。要找到成功引導的適用範圍與界限，通常需要耗費大量資源進行網格搜尋，並在事後評估完整的自回歸生成結果。本研究探討是否能根據模型在生成過程初期（例如生成前幾個詞元後）的內部狀態，預測其可引導性，並進一步利用此類預測器提升引導成功率。為此，我們首先引入 ASTEER 測試平台，包含 140 萬次引導生成結果，涵蓋 150 個概念，並為每次引導標記成功或失敗。利用此測試平台，我們透過提取特徵來分析模型早期的解碼動態，這些特徵比較了引導前後不同層與初始解碼步驟的隱藏狀態。這些特徵有助於理解引導效應如何在層與詞元位置之間傳播，從而為可引導性預測提供關鍵資訊。隨後，我們基於這些特徵訓練了一個梯度提升決策樹（GBDT）分類器，用以預測干預是否會導致欠引導、成功或過度引導，而無需進行完整生成。該預測器在未見過的概念上達到了約 0.7 的宏觀 F1 分數，顯示早期隱藏狀態已編碼了大量關於最終引導效能的結構化資訊。我們進一步將此可引導性預測器作為引導強度搜尋的指導，從而以極小的解碼成本達到近乎最優的效能。

English

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.