ChatPaper.aiChatPaper

TaTToo:面向表格推理测试时扩展的工具化思维过程关系模型

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

October 7, 2025
作者: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
cs.AI

摘要

过程奖励模型(PRMs)近期崭露头角,成为增强大规模推理模型(LRMs)推理能力的强大框架,尤其在测试时扩展(TTS)背景下表现突出。然而,其在监督LRMs进行表格推理领域的潜力尚未得到充分挖掘。通过细致的实证分析,我们发现现有的PRMs虽广泛用于监督纯文本推理步骤,但在处理子表检索和模式交互等表格特定操作时存在困难,导致关键性能瓶颈。为克服这一局限,我们提出了TaTToo,一个新颖的基于表格的PRM框架,它(i)明确地对表格推理步骤进行推理,并(ii)整合工具验证以提供精确的奖励监督。具体而言,我们首先设计了一个可扩展的数据构建流程,通过融合表格验证原理与基于工具的执行,构建了超过60,000条高质量步骤级标注。基于收集的数据,我们采用双阶段范式训练TaTToo:冷启动监督微调以捕捉工具使用推理模式,随后通过基于工具奖励塑造的强化学习,使模型与基于表格的验证对齐。我们对新设计的PRM带来的策略改进进行了全面评估。在涵盖数值推理、事实核查和数据分析的5个具有挑战性的表格推理基准测试中,TaTToo在推理阶段将下游策略LRMs提升了30.9%,仅以8B参数便超越了如Qwen-2.5-Math-PRM-72B等强大的PRM基线,并在多种TTS策略中展现出强大的泛化能力。
English
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
PDF593October 8, 2025