无需标签的语言模型进化:多数驱动选择,新颖促进变异
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
September 18, 2025
作者: Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
cs.AI
摘要
大型语言模型(LLMs)正越来越多地通过可验证奖励的强化学习(RLVR)进行训练,然而实际部署需要模型能够在无标签或外部评判的情况下自我提升。现有的无标签方法,如置信度最小化、自一致性或多数票目标,虽能稳定学习,却逐渐缩小探索范围,导致熵崩溃:生成内容变短、多样性降低且脆弱。与先前主要针对手头无标签数据集即时调整模型的测试时强化学习(TTRL)不同,我们的目标更为广泛:在不牺牲模型固有探索能力和泛化能力的前提下,实现普遍改进,即进化。我们正式提出这一问题,并提出了面向进化与无标签的强化学习(EVOL-RL),这是一种在无标签环境下将稳定性与变化性相结合的简单规则。EVOL-RL以多数票答案作为稳定锚点(选择),同时引入新颖性奖励,偏爱那些在语义空间中推理路径与众不同的回答(变化)。通过GRPO实现,EVOL-RL还采用非对称裁剪保留强信号,并通过熵正则化维持搜索。这种“多数票选择+新颖性变化”的设计防止了崩溃,保持了更长且信息更丰富的思维链,并提升了pass@1和pass@n的表现。EVOL-RL持续超越仅依赖多数票的TTRL基线;例如,在无标签的AIME24上训练,将Qwen3-4B-Base在AIME25上的pass@1从TTRL的4.6%提升至16.4%,pass@16从18.5%提升至37.9%。EVOL-RL不仅防止了多样性崩溃,还解锁了跨领域(如GPQA)的更强泛化能力。此外,我们展示了EVOL-RL在RLVR设置下也能提升性能,凸显了其广泛适用性。
English
Large language models (LLMs) are increasingly trained with reinforcement
learning from verifiable rewards (RLVR), yet real-world deployment demands
models that can self-improve without labels or external judges. Existing
label-free methods, confidence minimization, self-consistency, or majority-vote
objectives, stabilize learning but steadily shrink exploration, causing an
entropy collapse: generations become shorter, less diverse, and brittle. Unlike
prior approaches such as Test-Time Reinforcement Learning (TTRL), which
primarily adapt models to the immediate unlabeled dataset at hand, our goal is
broader: to enable general improvements without sacrificing the model's
inherent exploration capacity and generalization ability, i.e., evolving. We
formalize this issue and propose EVolution-Oriented and Label-free
Reinforcement Learning (EVOL-RL), a simple rule that couples stability with
variation under a label-free setting. EVOL-RL keeps the majority-voted answer
as a stable anchor (selection) while adding a novelty-aware reward that favors
responses whose reasoning differs from what has already been produced
(variation), measured in semantic space. Implemented with GRPO, EVOL-RL also
uses asymmetric clipping to preserve strong signals and an entropy regularizer
to sustain search. This majority-for-selection + novelty-for-variation design
prevents collapse, maintains longer and more informative chains of thought, and
improves both pass@1 and pass@n. EVOL-RL consistently outperforms the
majority-only TTRL baseline; e.g., training on label-free AIME24 lifts
Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5%
to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks
stronger generalization across domains (e.g., GPQA). Furthermore, we
demonstrate that EVOL-RL also boosts performance in the RLVR setting,
highlighting its broad applicability.