Zelf-Gedistilleerd RLVR

Samenvatting

On-policy distillatie (OPD) is een populaire trainingsparadigma geworden in de LLM-gemeenschap. Dit paradigma selecteert een groter model als leraar om dichte, fijnmazige signalen te leveren voor elke bemonsterde traject, in tegenstelling tot reinforcement learning met verifieerbare beloningen (RLVR), dat slechts sporadische signalen verkrijgt uit verifieerbare uitkomsten in de omgeving. Recentelijk heeft de gemeenschap on-policy zelfdistillatie (OPSD) verkend, waarbij hetzelfde model zowel als leraar als leerling fungeert, waarbij de leraar extra geprivilegieerde informatie ontvangt, zoals referentieantwoorden, om zelfevolutie mogelijk te maken. Dit artikel toont aan dat leersignalen die uitsluitend zijn afgeleid van de geprivilegieerde leraar leiden tot ernstige informatielekkage en instabiele training op lange termijn. Dienovereenkomstig identificeren we de optimale niche voor zelfdistillatie en stellen we RLSD (RLVR met Zelfdistillatie) voor. Concreet benutten we zelfdistillatie om token-level beleidsverschillen te verkrijgen voor het bepalen van fijnmazige update-grootten, terwijl we RLVR blijven gebruiken om betrouwbare update-richtingen af te leiden uit omgevingsfeedback (bijvoorbeeld de correctheid van antwoorden). Hierdoor kan RLSD gelijktijdig de sterke punten van zowel RLVR als OPSD benutten, wat resulteert in een hoger convergentieplafond en superieure trainingsstabiliteit.

English

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Zelf-Gedistilleerd RLVR

Self-Distilled RLVR

Samenvatting

Support