Heimdall: 生成検証におけるテスト時スケーリング

要旨

AIシステムは、その知識自体を検証できる範囲においてのみ、知識を生成し維持することができます。近年の長い連鎖的思考（Chain-of-Thought）推論に関する研究は、大規模言語モデル（LLM）が競争力のある問題を解決する上で大きな可能性を示していますが、その検証能力は依然として弱く、十分に調査されていません。本論文では、Heimdallという長い連鎖的思考検証LLMを提案し、解決策の正確性を正確に判断することができます。純粋な強化学習を用いることで、競争力のある数学問題における検証精度を62.5%から94.5%に向上させました。繰り返しサンプリングによるスケーリングにより、精度はさらに97.5%に向上しました。人間による評価を通じて、Heimdallは印象的な汎化能力を示し、トレーニング中に含まれていないタイプの難しい数学的証明におけるほとんどの問題を検出することに成功しました。さらに、Heimdallの機能を拡張し、問題解決をスケールアップするための悲観的検証（Pessimistic Verification）を提案します。これは、Heimdallを呼び出してソルバーモデルからの解決策を判断し、悲観的原理に基づいて最も正しい可能性が高く、不確実性が最も少ない解決策を選択します。DeepSeek-R1-Distill-Qwen-32Bをソルバーモデルとして使用した場合、悲観的検証はAIME2025における解決精度を54.2%から70.0%に向上させ、16倍の計算予算で83.3%に、さらに多くの計算予算で93.0%に到達しました。より強力なソルバーであるGemini 2.5 Proを使用すると、スコアは93.0%に達しました。最後に、自動知識発見システムのプロトタイプを作成しました。これは、質問を投げかけるコンポーネント、解決策を提供するコンポーネント、そして解決策を検証するコンポーネントからなる三元システムです。最初の2つのコンポーネントにNuminaMathのデータ合成作業を使用し、Heimdallはデータセット内の問題のあるレコードを効果的に特定し、データのほぼ半分が欠陥があることを明らかにしました。これは興味深いことに、NuminaMathの最近のアブレーション研究と一致しています。

English

An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Heimdall: 生成検証におけるテスト時スケーリング

Heimdall: test-time scaling on the generative verification

要旨

Support