Hard2Verify: オープンエンドなフロンティア数学のためのステップレベル検証ベンチマーク

要旨

大規模言語モデル（LLM）に基づく推論システムは、最近のIMO 2025競技において金メダルレベルの性能を達成し、数学的証明を記述する際に、完全な評価を得るためには各ステップが正しいだけでなく、十分に裏付けられている必要があるという厳しい条件を満たしました。このような挑戦的で開放的な設定においてLLMベースの推論システムを訓練するためには、ステップレベルの誤りを捕捉できる強力な検証器が不可欠な前提条件となります。本論文では、500時間以上の人的労力をかけて作成された、人間による注釈付きのステップレベル検証ベンチマーク「Hard2Verify」を紹介します。Hard2Verifyは、最先端のステップレベル検証器を厳密に評価するために設計されています。検証器は、最先端のLLMによって生成された回答に対してステップレベルの注釈を提供するか、または最初の誤りを特定する必要があります。これらの回答は、非常に最近の、挑戦的で開放的な数学的問題に対するものです。我々は29の生成的批評家とプロセス報酬モデルを評価し、いくつかの優れたモデルを除いて、オープンソースの検証器がクローズドソースモデルに遅れをとっていることを示しました。さらに、ステップレベル検証における性能の低さの原因、検証器の計算資源のスケーリングの影響、自己検証や検証-生成のダイナミクスといった根本的な問題について分析を行いました。

English

Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.

Hard2Verify: オープンエンドなフロンティア数学のためのステップレベル検証ベンチマーク

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

要旨

Support