TaTToo:基於工具的思維過程推理模型,用於表格推理中的測試時擴展
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
October 7, 2025
作者: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
cs.AI
摘要
過程獎勵模型(PRMs)近期已成為增強大型推理模型(LRMs)推理能力的強大框架,特別是在測試時擴展(TTS)的背景下。然而,其在監督LRMs進行表格推理領域的潛力仍未得到充分探索。通過詳細的實證分析,我們發現現有的PRMs雖然廣泛應用於監督純文本推理步驟,但在處理表格特定操作(如子表檢索和模式交互)時存在困難,導致關鍵的性能瓶頸。為解決這一限制,我們提出了TaTToo,一個新穎的基於表格的PRM框架,該框架(i)明確地對表格推理步驟進行推理,並(ii)整合基於工具的驗證以提供精確的獎勵監督。具體而言,我們首先設計了一個可擴展的數據整理管道,通過將表格驗證理由與基於工具的執行相結合,構建了超過60k的高質量步驟級註釋。基於收集的數據,我們採用雙階段範式訓練TaTToo:冷啟動監督微調以捕捉工具使用的推理模式,隨後進行基於工具獎勵塑造的強化學習,以將我們的模型與基於表格的驗證對齊。我們對新設計的PRM所引致的策略改進進行了全面評估。在涵蓋數值推理、事實核查和數據分析的5個具有挑戰性的表格推理基準測試中,TaTToo在推理時將下游策略LRMs提升了30.9%,僅以8B參數超越了如Qwen-2.5-Math-PRM-72B等強大的PRM基線,並在多樣化的TTS策略中展現出強大的泛化能力。
English
Process Reward Models (PRMs) have recently emerged as a powerful framework
for enhancing the reasoning capabilities of large reasoning models (LRMs),
particularly in the context of test-time scaling (TTS). However, their
potential for supervising LRMs on tabular reasoning domains remains
underexplored. Through detailed empirical analyses, we identify that existing
PRMs, though widely adopted for supervising text-only reasoning steps, struggle
with table-specific operations such as sub-table retrieval and schema
interaction, leading to critical performance bottlenecks. To address this
limitation, we propose TaTToo, a novel table-grounded PRM framework that (i)
reasons explicitly over tabular reasoning steps and (ii) integrates tool-based
verification to provide precise reward supervision. Concretely, we first design
a scalable data curation pipeline that constructs over 60k high-quality
step-level annotations by integrating table verification rationales with
tool-based executions. Building on the collected data, we train TaTToo with a
dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use
reasoning patterns, followed by reinforcement learning with tool-grounded
reward shaping to align our model with table-based verification. We provide a
comprehensive evaluation of the policy improvement induced by our newly
designed PRM. Across 5 challenging tabular reasoning benchmarks covering
numerical reasoning, fact-checking, and data analysis, TaTToo improves
downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines
such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong
generalizability across diverse TTS strategies.