インタラクティブベンチマーク

要旨

標準ベンチマークは、飽和状態、主観性、一般化の不十分さにより、信頼性が低下しつつあります。我々は、モデルの知能を評価するには、能動的に情報を獲得する能力を評価することが重要であると主張します。本論文では、予算制約下での対話的プロセスにおいてモデルの推論能力を評価する統一評価パラダイム「Interactive Benchmarks」を提案します。この枠組みを2つの設定で具体化します：論理や数学における客観的真実や答えを、裁判役との対話を通じて推論する「Interactive Proofs」と、長期的効用を最大化するために戦略的に推論する「Interactive Games」です。実験結果から、対話型ベンチマークはモデルの知能をロバストかつ忠実に評価でき、対話シナリオには依然として大幅な改善の余地があることが明らかになりました。プロジェクトページ：https://github.com/interactivebench/interactivebench

English

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench