레이블 없이 언어 모델을 진화시키기: 다수결이 선택을 주도하고, 새로움이 변이를 촉진한다

초록

대규모 언어 모델(LLMs)은 검증 가능한 보상으로부터의 강화 학습(RLVR)을 통해 점점 더 많이 훈련되고 있지만, 실제 세계에서의 배포는 레이블이나 외부 판단 없이도 스스로 개선할 수 있는 모델을 요구합니다. 기존의 레이블 없는 방법들, 즉 신뢰도 최소화, 자기 일관성, 또는 다수결 목표는 학습을 안정화하지만 탐색을 꾸준히 축소시켜 엔트로피 붕괴를 초래합니다: 생성물이 더 짧고, 다양성이 줄어들며, 취약해집니다. 즉각적인 레이블 없는 데이터셋에 모델을 적응시키는 데 주력하는 테스트 타임 강화 학습(TTRL)과 같은 기존 접근법과 달리, 우리의 목표는 더 넓습니다: 모델의 고유한 탐색 능력과 일반화 능력, 즉 진화를 희생하지 않으면서도 일반적인 개선을 가능하게 하는 것입니다. 우리는 이 문제를 공식화하고, 레이블 없는 설정에서 안정성과 변이를 결합하는 간단한 규칙인 EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL)을 제안합니다. EVOL-RL은 다수결 답변을 안정적인 앵커(선택)로 유지하면서, 이미 생성된 것과 다른 추론을 선호하는 새로움 인식 보상을 추가합니다(변이), 이는 의미 공간에서 측정됩니다. GRPO로 구현된 EVOL-RL은 또한 강한 신호를 보존하기 위해 비대칭 클리핑을 사용하고, 탐색을 유지하기 위해 엔트로피 정규화기를 사용합니다. 이 다수결-선택 + 새로움-변이 설계는 붕괴를 방지하고, 더 길고 정보가 풍부한 사고의 연쇄를 유지하며, pass@1과 pass@n 모두를 개선합니다. EVOL-RL은 다수결만을 사용하는 TTRL 기준선을 꾸준히 능가합니다; 예를 들어, 레이블 없는 AIME24로 훈련하면 Qwen3-4B-Base AIME25 pass@1이 TTRL의 4.6%에서 16.4%로, pass@16이 18.5%에서 37.9%로 상승합니다. EVOL-RL은 다양성 붕괴를 방지할 뿐만 아니라, 도메인 간(예: GPQA)에서 더 강력한 일반화를 가능하게 합니다. 또한, EVOL-RL이 RLVR 설정에서도 성능을 향상시킴을 보여주며, 그 광범위한 적용 가능성을 강조합니다.

English

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.

레이블 없이 언어 모델을 진화시키기: 다수결이 선택을 주도하고, 새로움이 변이를 촉진한다

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

초록

Support