대규모 추론 모델의 효과적 강화학습을 위한 적응형 능력 분해

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 큰 잠재력을 보여주고 있습니다. 그러나 RLVR 과정에서 제공되는 정보량이 제한적이기 때문에 모델은 대체로 무계획적인 탐색만 수행할 수 있으며, 이는 종종 난해한 문제에서 실패로 이어집니다. 교사 모델에 의존하지 않고 RLVR 과정에 추가 정보를 제공하기 위해, 우리는 RLVR의 효과를 높이는 적응형 능력 분해 방법인 A^2D를 제안합니다. 구체적으로, 우리는 먼저 지식 증류 없이 RLVR을 통해 분해기를 학습시켜 복잡한 질문을 더 단순한 하위 질문 집합으로 분해할 수 있도록 합니다. 다음으로, 이 분해기를 사용하여 훈련 데이터셋의 각 질문에 대한 하위 질문을 주석 처리하고, 하위 질문의 지도를 받는 RLVR 하에서 추론기를 학습시킵니다. A^2D를 더 잘 이해하기 위해, 먼저 그 성능을 경쟁력 있는 베이스라인과 비교하여 효과성을 입증합니다. 다음으로, 우리의 방법이 다양한 RLVR 알고리즘에 적용 가능한 플러그 앤 플레이 모듈로 기능함을 확인합니다. 더 나아가 분해기에 대한 분석을 수행하여 RLVR 과정이 그 성능과 행동에 어떤 영향을 미치는지, 그리고 어떤 유형의 지도가 추론기의 탐색 및 활용 능력 향상에 더 적합한지 밝혀냅니다.

English

Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A^2D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A^2D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.

대규모 추론 모델의 효과적 강화학습을 위한 적응형 능력 분해

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

초록

Support