TaTToo: 表形式推論におけるテスト時スケーリングのためのツール基盤型思考PRM

要旨

プロセス報酬モデル（PRM）は、大規模推論モデル（LRM）の推論能力を強化するための強力なフレームワークとして最近注目を集めており、特にテストタイムスケーリング（TTS）の文脈でその有用性が示されている。しかし、表形式の推論領域におけるLRMの監督ツールとしての潜在能力はまだ十分に探求されていない。詳細な実証分析を通じて、既存のPRMがテキストのみの推論ステップの監督には広く採用されているものの、サブテーブルの検索やスキーマの相互作用といった表固有の操作に苦戦し、重要な性能のボトルネックを引き起こしていることが明らかとなった。この制約を克服するため、我々はTaTTooという新しい表に基づくPRMフレームワークを提案する。TaTTooは、(i) 表形式の推論ステップを明示的に推論し、(ii) ツールベースの検証を統合して正確な報酬監督を提供する。具体的には、まず、表検証の理論的根拠とツールベースの実行を統合することで、60,000以上の高品質なステップレベルのアノテーションを構築するスケーラブルなデータキュレーションパイプラインを設計した。収集したデータに基づき、TaTTooを二段階のパラダイムで訓練する。第一段階では、ツール使用の推論パターンを捉えるためのコールドスタートの教師ありファインチューニングを行い、第二段階では、ツールに基づく報酬形成を用いた強化学習を行い、表ベースの検証にモデルを適合させる。我々は、新たに設計したPRMによって誘発されるポリシー改善を包括的に評価する。数値推論、ファクトチェック、データ分析をカバーする5つの挑戦的な表形式推論ベンチマークにおいて、TaTTooは推論時に下流のポリシーLRMを30.9%改善し、8BパラメータのみでQwen-2.5-Math-PRM-72Bのような強力なPRMベースラインを上回り、多様なTTS戦略にわたる強い汎化能力を示した。

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

TaTToo: 表形式推論におけるテスト時スケーリングのためのツール基盤型思考PRM

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

要旨

Support