学習課題としての将来の行動予測

要旨

AIシステムへの信頼は、多くの場合、その仕組みを説明する説明に基づいており、その説明を用いて新しい入力に対する挙動を予測する。大規模推論モデル（LRM）の場合、この従来の方法は特に困難である。個々のトークン生成に対する説明手法は、長い軌跡に自然に一般化できず、また軌跡そのものを自然言語として読むと正確でないことが多い。本稿では、説明のステップを回避する代替手法を提案する。すなわち、行動予測を学習可能なタスクとして扱い、単一の推論軌跡に基づいて、通常説明から得たい予測と同じ予測を行う「行動予測器（Behavior Forecaster）」を訓練する。予測器の訓練データは、人間のアノテーションなしでLRMに問い合わせることで得られ、その推論は単一の順伝搬で実行される。本手法を2つのタスクに適用する。すなわち、LRMが再実行時にその回答を繰り返す可能性、および入力の一部を削除すると回答がどのように変化するかである。3つの多様な推論データセットを用いて両方のタスクで本手法を評価した結果、訓練された行動予測器は、同じ軌跡を単純に読むGPT-5.4やClaude Opus-4.6よりも高精度であり、その推論コストはごく一部であることがわかった。また、バックボーンをエンドツーエンドで微調整し、対象LRMから初期化することが、強力な性能にそれぞれ必要であることが判明した。これらの結果は、推論軌跡が、単純な読み取りでは伝わらないLRMの将来の行動に関する情報を含むことを示している。

English

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.