TaTToo：面向表格推理测试时扩展的工具化思维过程关系模型

摘要

过程奖励模型（PRMs）近期崭露头角，成为增强大规模推理模型（LRMs）推理能力的强大框架，尤其在测试时扩展（TTS）背景下表现突出。然而，其在监督LRMs进行表格推理领域的潜力尚未得到充分挖掘。通过细致的实证分析，我们发现现有的PRMs虽广泛用于监督纯文本推理步骤，但在处理子表检索和模式交互等表格特定操作时存在困难，导致关键性能瓶颈。为克服这一局限，我们提出了TaTToo，一个新颖的基于表格的PRM框架，它（i）明确地对表格推理步骤进行推理，并（ii）整合工具验证以提供精确的奖励监督。具体而言，我们首先设计了一个可扩展的数据构建流程，通过融合表格验证原理与基于工具的执行，构建了超过60,000条高质量步骤级标注。基于收集的数据，我们采用双阶段范式训练TaTToo：冷启动监督微调以捕捉工具使用推理模式，随后通过基于工具奖励塑造的强化学习，使模型与基于表格的验证对齐。我们对新设计的PRM带来的策略改进进行了全面评估。在涵盖数值推理、事实核查和数据分析的5个具有挑战性的表格推理基准测试中，TaTToo在推理阶段将下游策略LRMs提升了30.9%，仅以8B参数便超越了如Qwen-2.5-Math-PRM-72B等强大的PRM基线，并在多种TTS策略中展现出强大的泛化能力。

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.