ChatPaper.aiChatPaper

TaTToo:基於工具的思維過程推理模型,用於表格推理中的測試時擴展

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

October 7, 2025
作者: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
cs.AI

摘要

過程獎勵模型(PRMs)近期已成為增強大型推理模型(LRMs)推理能力的強大框架,特別是在測試時擴展(TTS)的背景下。然而,其在監督LRMs進行表格推理領域的潛力仍未得到充分探索。通過詳細的實證分析,我們發現現有的PRMs雖然廣泛應用於監督純文本推理步驟,但在處理表格特定操作(如子表檢索和模式交互)時存在困難,導致關鍵的性能瓶頸。為解決這一限制,我們提出了TaTToo,一個新穎的基於表格的PRM框架,該框架(i)明確地對表格推理步驟進行推理,並(ii)整合基於工具的驗證以提供精確的獎勵監督。具體而言,我們首先設計了一個可擴展的數據整理管道,通過將表格驗證理由與基於工具的執行相結合,構建了超過60k的高質量步驟級註釋。基於收集的數據,我們採用雙階段範式訓練TaTToo:冷啟動監督微調以捕捉工具使用的推理模式,隨後進行基於工具獎勵塑造的強化學習,以將我們的模型與基於表格的驗證對齊。我們對新設計的PRM所引致的策略改進進行了全面評估。在涵蓋數值推理、事實核查和數據分析的5個具有挑戰性的表格推理基準測試中,TaTToo在推理時將下游策略LRMs提升了30.9%,僅以8B參數超越了如Qwen-2.5-Math-PRM-72B等強大的PRM基線,並在多樣化的TTS策略中展現出強大的泛化能力。
English
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
PDF593October 8, 2025