Scomposizione delle Capacità Adattive per Sbloccare l'Apprendimento per Rinforzo Efficace nei Modelli di Ragionamento Complessi

Abstract

L'apprendimento per rinforzo con ricompense verificabili (RLVR) ha dimostrato un grande potenziale nel potenziare le capacità di ragionamento dei grandi modelli linguistici (LLM). Tuttavia, a causa della quantità limitata di informazioni fornite durante il processo RLVR, il modello può impegnarsi solo in un'esplorazione prevalentemente casuale, che spesso si traduce in fallimenti su problemi complessi. Per fornire informazioni aggiuntive al processo RLVR senza fare affidamento su un modello insegnante, proponiamo A^2D, un metodo di Scomposizione Adattiva delle Abilità per migliorare l'efficacia dell'RLVR. Nello specifico, addestriamo prima un scompositore tramite RLVR senza distillazione, permettendogli di scomporre domande complesse in una serie di sotto-domande più semplici. Successivamente, utilizziamo questo scompositore per annotare le sotto-domande per ogni domanda nel dataset di addestramento, e poi addestriamo il motore di ragionamento sotto RLVR con la guida delle sotto-domande. Per comprendere meglio A^2D, confrontiamo prima le sue prestazioni con baseline competitive, dimostrandone l'efficacia. In seguito, osserviamo che il nostro metodo funziona come un modulo plug-and-play che può essere applicato a diversi algoritmi RLVR. Inoltre, conduciamo un'analisi dello scompositore, rivelando come il processo RLVR influisce sulle sue prestazioni e sul suo comportamento, e quale tipo di guida sia più adatta a potenziare le capacità di esplorazione e sfruttamento del motore di ragionamento.

English

Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A^2D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A^2D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.

Scomposizione delle Capacità Adattive per Sbloccare l'Apprendimento per Rinforzo Efficace nei Modelli di Ragionamento Complessi

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Abstract

Support