Table-R1: 테이블 추론을 위한 추론 시간 스케일링

초록

본 연구에서는 테이블 추론 작업에서의 추론 시점 스케일링을 탐구한 첫 번째 연구를 소개합니다. 우리는 추론 시점 스케일링을 가능하게 하는 두 가지 사후 학습 전략을 개발하고 평가했습니다: 프론티어 모델의 추론 흔적을 활용한 지식 증류와 검증 가능한 보상을 사용한 강화 학습(RLVR). 지식 증류를 위해, 우리는 DeepSeek-R1에 의해 생성된 대규모 추론 흔적 데이터셋을 소개하고 이를 사용하여 LLM을 Table-R1-SFT 모델로 미세 조정했습니다. RLVR의 경우, 작업별 검증 가능한 보상 함수를 제안하고 GRPO 알고리즘을 적용하여 Table-R1-Zero 모델을 얻었습니다. 우리는 Table-R1 시리즈 모델을 다양한 테이블 추론 작업(단문 질의응답, 사실 검증, 자유형 질의응답 등)에서 평가했습니다. 특히, Table-R1-Zero 모델은 GPT-4.1 및 DeepSeek-R1의 성능을 맞추거나 능가하면서도 단 7B 파라미터의 LLM만을 사용했습니다. 또한 이 모델은 도메인 외 데이터셋에 대한 강력한 일반화 능력을 보여주었습니다. 광범위한 절제 연구와 질적 분석을 통해 명령어 튜닝, 모델 아키텍처 선택, 교차 작업 일반화의 이점과 RL 훈련 중 필수적인 테이블 추론 기술의 출현을 확인했습니다.

English

In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

Table-R1: 테이블 추론을 위한 추론 시간 스케일링

Table-R1: Inference-Time Scaling for Table Reasoning

초록

Support