ChatPaper.aiChatPaper

自适应能力分解:解锁大型推理模型的高效强化学习

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

January 31, 2026
作者: Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen
cs.AI

摘要

具備可驗證獎勵的強化學習(RLVR)在提升大型語言模型推理能力方面展現出巨大潛力。然而,由於RLVR過程中提供的資訊有限,模型只能進行近乎盲目的探索,這往往導致其在應對複雜問題時失敗。為了在不依賴教師模型的前提下為RLVR過程提供額外資訊,我們提出A²D方法——一種增強RLVR效應的自適應能力分解技術。具體而言,我們首先通過無蒸餾的RLVR訓練分解器,使其能將複雜問題分解為一系列簡單子問題;接著利用該分解器為訓練數據集中的每個問題標註子問題,並在子問題引導下通過RLVR訓練推理器。為深入理解A²D,我們先將其性能與競爭基準進行對比以驗證有效性,進而發現該方法可作為即插即用模組應用於不同RLVR算法。此外,我們對分解器展開分析,揭示了RLVR過程如何影響其性能與行為模式,以及何種引導方式更能強化推理器的探索與利用能力。
English
Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A^2D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A^2D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.
PDF52February 7, 2026