분할 정복 추론을 위한 LLM 훈련이 테스트 시간 확장성을 높인다

초록

대규모 언어 모델(LLM)은 단계별 사고 연쇄(CoT) 추론을 통해 강력한 추론 능력을 입증해왔습니다. 그러나 모델 능력의 한계에서는 CoT가 종종 불충분하며, 엄격하게 순차적인 특성으로 인해 테스트 시 확장성이 제한됩니다. 잠재적인 대안은 분할 정복(DAC) 추론으로, 복잡한 문제를 하위 문제로 분해하여 보다 효과적인 솔루션 탐색을 용이하게 합니다. 유망함에도 불구하고, 우리의 분석은 일반적인 사후 훈련과 DAC 스타일 추론 간의 근본적인 불일치를 드러내며, 이는 모델이 이러한 잠재력을 완전히 활용하는 능력을 제한합니다. 이러한 격차를 해소하고 가장 어려운 과제에서 LLM의 추론 능력을 완전히 개방하기 위해, 우리는 DAC 스타일 추론 능력을 향상시키기 위한 종단간 강화 학습(RL) 프레임워크를 제안합니다. 각 단계에서 정책은 문제를 일련의 하위 문제로 분해하고, 이를 순차적으로 해결하며, 하위 문제 솔루션을 조건으로 원래 문제를 해결하는데, 분해와 솔루션 모두 RL 훈련에 통합됩니다. 유사한 훈련 조건에서 우리의 DAC 스타일 프레임워크는 모델에 더 높은 성능 한계와 더 강력한 테스트 시 확장성을 부여하며, 경쟁 수준 벤치마크에서 Pass@1 기준 8.6%, Pass@32 기준 6.3%로 CoT를 능가했습니다.

English

Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

분할 정복 추론을 위한 LLM 훈련이 테스트 시간 확장성을 높인다

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

초록

Support