얼마나 많은 역추적이 충분한가? LLM 추론 향상을 위한 SFT와 RL의 상호작용 탐구

초록

대규모 언어 모델(LLM)의 최근 획기적인 발전은 지도 미세조정(SFT) 및 강화학습(RL)과 같은 기술을 통해, 특히 검증 가능한 답이 있는 수학적 및 논리적 문제에서 추론 능력을 효과적으로 향상시켰습니다. 선행 연구에 따르면, RL은 탐색 전략을 효과적으로 내재화하여 긴 사고의 연쇄(CoT) 추론을 가능하게 하며, 역추적(backtracking)이 학습된 능력으로 자연스럽게 나타납니다. 그러나 역추적의 정확한 이점, 특히 추론 개선에 얼마나 크게 기여하는지와 그 사용의 최적 범위는 아직 잘 이해되지 않고 있습니다. 본 연구에서는 Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, Self Reference 등 8가지 추론 과제에서 SFT와 RL 간의 역학을 체계적으로 조사합니다. 우리의 연구 결과는 SFT에서 워밍업으로 사용된 짧은 CoT 시퀀스가 콜드 스타트 RL과 비교했을 때 RL 훈련에 어느 정도 기여하지만, 과제가 점점 더 어려워질수록 이러한 기여도가 감소한다는 것을 보여줍니다. 이러한 관찰에 동기를 부여받아, 우리는 역추적 단계의 수를 체계적으로 변화시킨 합성 데이터셋을 구성하고, 정확성(내용) 또는 구조(즉, 역추적 빈도)의 영향을 분리하기 위해 통제된 실험을 수행합니다. 우리는 (1) 역추적이 포함된 더 긴 CoT가 일반적으로 더 나은 RL 훈련을 유도하고 더 안정적이며, (2) 더 큰 탐색 공간을 가진 더 어려운 문제는 SFT 단계에서 더 많은 역추적이 필요하다는 것을 발견했습니다. 또한, 증류된 데이터에 대한 실험을 통해 RL 훈련이 긴 CoT 시퀀스의 정확성에 크게 영향을 받지 않는다는 것을 보여주며, 이는 RL이 내용의 정확성보다 구조적 패턴을 우선시한다는 것을 시사합니다. 종합적으로, 우리의 결과는 LLM에서 추론을 효과적으로 확장하기 위한 최적의 훈련 전략 설계에 실질적인 통찰을 제공합니다.

English

Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.

얼마나 많은 역추적이 충분한가? LLM 추론 향상을 위한 SFT와 RL의 상호작용 탐구

How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

초록

Support