無標籤下的語言模型進化:多數驅動選擇,新穎促進變異
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
September 18, 2025
作者: Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
cs.AI
摘要
大型語言模型(LLMs)越來越多地採用可驗證獎勵的強化學習(RLVR)進行訓練,然而實際部署需要模型能夠在無標籤或外部評判的情況下自我改進。現有的無標籤方法,如信心最小化、自我一致性或多數投票目標,雖能穩定學習,但逐漸縮小探索範圍,導致熵崩潰:生成的內容變得更短、多樣性降低且脆弱。與之前主要針對手頭無標籤數據集進行即時適應的測試時強化學習(TTRL)方法不同,我們的目標更為廣泛:在不犧牲模型固有探索能力和泛化能力的前提下,實現普遍改進,即進化。我們將此問題形式化,並提出了面向進化的無標籤強化學習(EVOL-RL),這是一個在無標籤環境下結合穩定性與變化的簡單規則。EVOL-RL將多數投票的答案作為穩定錨點(選擇),同時添加一個新穎性感知獎勵,該獎勵偏愛那些推理過程與已生成內容不同的回應(變化),並在語義空間中進行衡量。通過GRPO實現的EVOL-RL,還採用非對稱裁剪來保留強信號,並使用熵正則化器來維持搜索。這種“多數選擇+新穎變化”的設計防止了崩潰,保持了更長且信息量更大的思維鏈,並提升了pass@1和pass@n的表現。EVOL-RL在多數情況下均優於僅基於多數的TTRL基線;例如,在無標籤的AIME24上訓練,將Qwen3-4B-Base在AIME25上的pass@1從TTRL的4.6%提升至16.4%,pass@16從18.5%提升至37.9%。EVOL-RL不僅防止了多樣性崩潰,還釋放了跨領域(如GPQA)的更強泛化能力。此外,我們展示了EVOL-RL在RLVR設置下也能提升性能,凸顯了其廣泛的適用性。
English
Large language models (LLMs) are increasingly trained with reinforcement
learning from verifiable rewards (RLVR), yet real-world deployment demands
models that can self-improve without labels or external judges. Existing
label-free methods, confidence minimization, self-consistency, or majority-vote
objectives, stabilize learning but steadily shrink exploration, causing an
entropy collapse: generations become shorter, less diverse, and brittle. Unlike
prior approaches such as Test-Time Reinforcement Learning (TTRL), which
primarily adapt models to the immediate unlabeled dataset at hand, our goal is
broader: to enable general improvements without sacrificing the model's
inherent exploration capacity and generalization ability, i.e., evolving. We
formalize this issue and propose EVolution-Oriented and Label-free
Reinforcement Learning (EVOL-RL), a simple rule that couples stability with
variation under a label-free setting. EVOL-RL keeps the majority-voted answer
as a stable anchor (selection) while adding a novelty-aware reward that favors
responses whose reasoning differs from what has already been produced
(variation), measured in semantic space. Implemented with GRPO, EVOL-RL also
uses asymmetric clipping to preserve strong signals and an entropy regularizer
to sustain search. This majority-for-selection + novelty-for-variation design
prevents collapse, maintains longer and more informative chains of thought, and
improves both pass@1 and pass@n. EVOL-RL consistently outperforms the
majority-only TTRL baseline; e.g., training on label-free AIME24 lifts
Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5%
to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks
stronger generalization across domains (e.g., GPQA). Furthermore, we
demonstrate that EVOL-RL also boosts performance in the RLVR setting,
highlighting its broad applicability.