针对分而治之推理的大语言模型训练可提升测试阶段的可扩展性

摘要

大型语言模型（LLMs）通过逐步的思维链（CoT）推理已展现出强大的推理能力。然而在模型能力边界处，CoT往往显得力有不逮，且其严格线性特性制约了测试时的可扩展性。分治（DAC）推理作为一种潜在替代方案，通过将复杂问题分解为子问题来促进更有效的解决方案探索。尽管前景可观，我们的分析发现通用后训练与DAC式推理之间存在根本性错位，限制了模型充分发挥这种潜力。为弥补这一差距并全面释放LLMs在最具挑战性任务上的推理能力，我们提出端到端强化学习（RL）框架来增强其DAC式推理能力。该框架在每一步将问题分解为子问题集，依次求解后基于子问题解决方案处理原问题，并将分解与求解过程共同纳入RL训练。在同等训练条件下，我们的DAC式框架赋予模型更高的性能上限和更强的测试时扩展性，在竞赛级基准测试中Pass@1和Pass@32指标分别超越CoT方法8.6%和6.3%。

English

Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

针对分而治之推理的大语言模型训练可提升测试阶段的可扩展性

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

摘要

Support