학습 과제로서의 미래 행동 예측

초록

AI 시스템에 대한 신뢰는 종종 시스템이 어떻게 작동하는지에 대한 설명에 기반하며, 이를 통해 새로운 입력에 대한 시스템의 행동을 예측하게 된다. 대규모 추론 모델(LRM)의 경우 이러한 전통적인 접근 방식은 특히 따르기 어렵다. 단일 토큰 생성을 위한 설명 방법은 긴 궤적에 자연스럽게 일반화되지 않으며, 궤적 자체도 자연어로 읽을 때 종종 신뢰할 수 없다. 우리는 설명 단계를 우회하는 대안을 제안한다: 행동 예측을 학습 가능한 과제로 취급하고, 단일 추론 궤적에 대해 작동하는 행동 예측기(Behavior Forecaster)를 훈련시켜 일반적으로 설명에서 얻고자 하는 예측과 동일한 결과를 도출하는 것이다. 예측기의 훈련 데이터는 인간의 주석 없이 LRM에 질의하여 얻어지며, 추론은 단일 순방향 패스로 수행된다. 우리는 이 접근 방식을 두 가지 과제에 적용한다: LRM이 재실행 시 답을 반복할 가능성, 그리고 입력의 일부를 제거할 때 답이 어떻게 변하는지 예측하는 것이다. 우리는 세 가지 다양한 추론 데이터셋에 걸쳐 두 과제 모두에서 이 접근 방식을 평가했으며, 훈련된 행동 예측기가 동일한 궤적을 단순 판독기로 읽는 GPT-5.4 및 Claude Opus-4.6보다 더 높은 정확도를 보이면서도 추론 비용은 극히 일부에 불과함을 발견했다. 또한 백본을 end-to-end로 미세 조정하고 대상 LRM으로 초기화하는 것이 강력한 성능에 각각 필수적임을 확인했다. 이러한 결과는 추론 궤적이 단순 판독이 전달하는 정보를 넘어서는 LRM의 미래 행동에 대한 정보를 담고 있음을 보여준다.

English

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.