TaTToo: 테이블 형식 추론을 위한 테스트 시간 스케일링을 위한 도구 기반 사고 PRM

초록

프로세스 보상 모델(PRMs)은 최근 대규모 추론 모델(LRMs)의 추론 능력을 향상시키는 강력한 프레임워크로 부상했으며, 특히 테스트 타임 스케일링(TTS) 맥락에서 그 잠재력이 주목받고 있습니다. 그러나 테이블 기반 추론 영역에서 LRMs를 감독하는 데 있어서의 가능성은 아직 충분히 탐구되지 않았습니다. 본 연구는 실증적 분석을 통해 기존 PRMs가 텍스트 전용 추론 단계를 감독하는 데는 널리 채택되었지만, 하위 테이블 검색 및 스키마 상호작용과 같은 테이블 특화 작업에서는 어려움을 겪으며 심각한 성능 병목 현상을 초래한다는 점을 확인했습니다. 이러한 한계를 해결하기 위해, 우리는 TaTToo라는 새로운 테이블 기반 PRM 프레임워크를 제안합니다. TaTToo는 (i) 테이블 기반 추론 단계를 명시적으로 추론하고 (ii) 도구 기반 검증을 통합하여 정밀한 보상 감독을 제공합니다. 구체적으로, 우리는 먼저 테이블 검증 논리와 도구 기반 실행을 통합하여 60,000개 이상의 고품질 단계별 주석을 구성하는 확장 가능한 데이터 큐레이션 파이프라인을 설계했습니다. 수집된 데이터를 바탕으로, 우리는 TaTToo를 이중 단계 패러다임으로 학습시킵니다: 도구 사용 추론 패턴을 포착하기 위한 콜드 스타트 지도 미세 조정 단계와, 테이블 기반 검증과 모델을 정렬하기 위한 도구 기반 보상 형성 강화 학습 단계입니다. 우리는 새롭게 설계된 PRM이 유도하는 정책 개선에 대한 포괄적인 평가를 제공합니다. 수치 추론, 사실 확인, 데이터 분석을 아우르는 5개의 도전적인 테이블 기반 추론 벤치마크에서, TaTToo는 추론 시 하위 정책 LRMs를 30.9% 향상시켰으며, Qwen-2.5-Math-PRM-72B와 같은 강력한 PRM 베이스라인을 단 8B 파라미터로 능가했고, 다양한 TTS 전략에 걸쳐 강력한 일반화 능력을 입증했습니다.

English

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

TaTToo: 테이블 형식 추론을 위한 테스트 시간 스케일링을 위한 도구 기반 사고 PRM

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

초록

Support