다양성 붕괴 완화의 핵심 요소: 검증 가능한 보상을 활용한 강화 학습에서의 발산 선택

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)으로 대규모 언어 모델(LLM)을 미세 조정할 때 발생하는 주요 역설은 단일 시도 정확도(Pass@1)가 개선됨에도 불구하고 다중 시도 성능(Pass@k)이 자주 저하된다는 점입니다. 이는 종종 모델이 이전에 습득한 기술을 잃어버리는 치명적 망각(catastrophic forgetting)과 동반됩니다. 다양한 방법이 제안되었지만, 발산 항(divergence term)의 선택과 기능이 적극적인 해결책으로서 놀랍도록 간과되어 왔습니다. 우리는 표준 RLVR 목표 함수들—모드 탐색 역방향 KL-발산(reverse KL-divergence)을 사용하는 것들과 발산 항을 전혀 사용하지 않는 것들 모두—이 지식 보존을 위한 중요한 메커니즘을 결여하고 있다고 주장합니다. 역방향 KL-발산은 정책을 좁히면서 이러한 쇠퇴를 가속화하고, 발산 항의 부재는 모델이 다양한 지식 기반에서 이탈하는 것을 방지할 수 없습니다. 우리는 관점의 근본적인 전환을 제안합니다: 발산 항 자체를 해결책으로 사용하는 것입니다. 우리의 프레임워크인 다양성 보존 하이브리드 RL(Diversity-Preserving Hybrid RL, DPH-RL)은 순방향 KL-발산(forward-KL)과 JS-발산(JS-divergence)과 같은 질량-포괄적 f-발산(mass-covering f-divergences)을 활용하여 리허설 메커니즘으로 기능합니다. 초기 정책을 지속적으로 참조함으로써, 이 접근 방식은 모델이 광범위한 해결책을 유지하도록 강제합니다. 수학 및 SQL 생성에 대한 광범위한 실험을 통해 DPH-RL이 Pass@k 저하를 해결할 뿐만 아니라 도메인 내외에서 Pass@1과 Pass@k 모두를 개선함을 입증했습니다. 또한, DPH-RL은 생성자 함수(generator functions)를 사용하여 f-발산을 계산하기 때문에 초기 정책에서만 샘플링이 필요하고 온라인 참조 모델이 필요하지 않아 더 효율적으로 학습됩니다. 우리의 연구는 RLVR을 개선하기 위한 중요한 간과된 축을 강조하며, 적절한 발산 측정 선택이 더 일반적이고 다양한 추론 모델을 구축하기 위한 강력한 도구임을 입증합니다.

English

A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.

다양성 붕괴 완화의 핵심 요소: 검증 가능한 보상을 활용한 강화 학습에서의 발산 선택

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

초록

Support