自适应能力分解：解锁大型推理模型的高效强化学习

摘要

可验证奖励的强化学习（RLVR）在提升大语言模型推理能力方面展现出巨大潜力。然而由于RLVR过程中提供的信息有限，模型只能进行近乎盲目的探索，这往往导致其在复杂问题上失败。为了在不依赖教师模型的情况下为RLVR过程提供额外信息，我们提出A²D方法——一种增强RLVR效果的自适应能力分解策略。具体而言，我们首先通过无蒸馏的RLVR训练分解器，使其能将复杂问题分解为若干简单子问题；随后利用该分解器对训练集中的每个问题标注子问题，再基于子问题指导通过RLVR训练推理器。为深入理解A²D，我们首先将其与主流基线方法进行性能对比，证明其有效性；进而发现该方法可作为即插即用模块适配不同RLVR算法；最后通过对分解器的分析，揭示了RLVR过程如何影响其性能与行为，以及何种指导方式更能提升推理器的探索与利用能力。

English

Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A^2D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A^2D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.

自适应能力分解：解锁大型推理模型的高效强化学习

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

摘要

Support