你的大语言模型何时可引导？

摘要

激活引导提供了一种轻量级的方法，可在推理时控制语言模型的行为，但其成败在很大程度上取决于提示词、概念、模型和引导配置。要找到成功引导的范围和边界，通常需要进行昂贵的网格搜索以及对完整自回归生成序列的事后评估。本研究探讨了能否从模型生成过程初期（例如生成前几个词元后）的内部状态预测其可引导性，以及如何利用这种预测器提高引导成功率。为此，我们首先引入ASTEER测试平台，其中包含140万次受引导的生成结果，涵盖150个概念，每次引导均标注为成功或失败。基于该测试平台，我们通过提取特征来分析模型的早期解码动态，这些特征比较了不同层和初始解码步骤中引导前后的隐藏状态。这些特征有助于理解引导效果如何在层和词元位置间传播，从而为可引导性预测提供关键信息。随后，我们基于这些特征训练梯度提升决策树分类器，以预测干预会产生欠引导、成功还是过引导，而无需执行完整生成序列。该预测器在未见概念上实现了约0.7的宏平均F1分数，表明早期隐藏状态编码了大量有关最终引导效果的结构化信息。我们进一步将该可引导性预测器作为引导强度搜索的指导，以极小的解码成本实现了接近最优的性能。

English

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.