TaTToo：基於工具的思維過程推理模型，用於表格推理中的測試時擴展

摘要

過程獎勵模型（PRMs）近期已成為增強大型推理模型（LRMs）推理能力的強大框架，特別是在測試時擴展（TTS）的背景下。然而，其在監督LRMs進行表格推理領域的潛力仍未得到充分探索。通過詳細的實證分析，我們發現現有的PRMs雖然廣泛應用於監督純文本推理步驟，但在處理表格特定操作（如子表檢索和模式交互）時存在困難，導致關鍵的性能瓶頸。為解決這一限制，我們提出了TaTToo，一個新穎的基於表格的PRM框架，該框架（i）明確地對表格推理步驟進行推理，並（ii）整合基於工具的驗證以提供精確的獎勵監督。具體而言，我們首先設計了一個可擴展的數據整理管道，通過將表格驗證理由與基於工具的執行相結合，構建了超過60k的高質量步驟級註釋。基於收集的數據，我們採用雙階段範式訓練TaTToo：冷啟動監督微調以捕捉工具使用的推理模式，隨後進行基於工具獎勵塑造的強化學習，以將我們的模型與基於表格的驗證對齊。我們對新設計的PRM所引致的策略改進進行了全面評估。在涵蓋數值推理、事實核查和數據分析的5個具有挑戰性的表格推理基準測試中，TaTToo在推理時將下游策略LRMs提升了30.9%，僅以8B參數超越了如Qwen-2.5-Math-PRM-72B等強大的PRM基線，並在多樣化的TTS策略中展現出強大的泛化能力。

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

TaTToo：基於工具的思維過程推理模型，用於表格推理中的測試時擴展

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

摘要

Support