大语言模型内嵌失败模式:通过预生成激活预测成功概率
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
February 10, 2026
作者: William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
cs.AI
摘要
在每一个问题上都运行具备扩展推理能力的大型语言模型成本高昂,但如何确定哪些输入真正需要额外计算资源仍是挑战。我们研究模型在生成答案前,能否从其内部表征中提取出对自身成功几率的预判,并利用这种信号指导更高效的推理。通过在生成前激活状态上训练线性探针,我们成功预测了模型在数学和编程任务上的特定策略成功率,其表现显著优于问题长度和TF-IDF等表层特征。借助E2H-AMC数据集(该数据集同时提供人类和模型在相同问题上的表现数据),我们发现模型编码了一种区别于人类认知的、模型专属的难度感知,且这种差异随着扩展推理的深入而扩大。利用这些探针,我们证明在模型池中进行查询路由的策略,可以在MATH数据集上节省高达70%的推理成本的同时,超越单一最佳模型的性能。这表明即使内部表征与人类对难度的直觉存在分歧,它们仍能实现实际的效率提升。代码已开源:https://github.com/KabakaWilliam/llms_know_difficulty
English
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty