大規模言語モデルは失敗を符号化する：生成前活性化からの成功予測

要旨

大規模言語モデル（LLM）に拡張推論を要する問題すべてで実行させることはコストがかかるが、どの入力が実際に追加の計算資源を必要とするかを判断することは依然として困難である。本研究では、生成前の内部表現からモデル自身の成功可能性が回収可能かどうか、またこの信号がより効率的な推論の指針となり得るかどうかを検証する。数学およびコーディング課題において、生成前の活性化状態に線形プローブを適用し、政策特化的な成功を予測するモデルを訓練した結果、質問の長さやTF-IDFといった表面的特徴を大幅に上回る性能を示した。同一問題に対する人間とモデルの双方の性能データを提供するE2H-AMCを活用し、モデルが人間の困難度とは異なるモデル特有の困難度概念を符号化していること、さらにこの差異が拡張推論に伴い拡大することを明らかにした。これらのプローブを活用し、複数モデル群へのクエリ振り分けを行うことで、MATHデータセットにおいて推論コストを最大70%削減しつつ最高性能モデルを上回る性能を達成できることを実証し、内部表現が人間の困難度直感と乖離している場合でも実用的な効率向上を可能にすることを示した。実装コードはhttps://github.com/KabakaWilliam/llms_know_difficultyで公開している。

English

Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

大規模言語モデルは失敗を符号化する：生成前活性化からの成功予測

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

要旨

Support