ラベルなしでの言語モデルの進化：多数決が選択を駆動し、新奇性が多様性を促進する

要旨

大規模言語モデル（LLM）は、検証可能な報酬からの強化学習（RLVR）を用いて訓練されることが増えていますが、実世界での展開では、ラベルや外部の評価者なしに自己改善できるモデルが求められています。既存のラベルフリー手法、例えば信頼度最小化、自己一貫性、または多数決目的関数は、学習を安定化させますが、探索を徐々に縮小させ、エントロピーの崩壊を引き起こします。生成されるテキストは短く、多様性がなく、脆くなります。従来のアプローチであるテストタイム強化学習（TTRL）とは異なり、我々の目標はより広範です。即時のラベルなしデータセットに適応するだけでなく、モデルの本来の探索能力と汎化能力、すなわち進化を犠牲にすることなく、一般的な改善を可能にすることです。この問題を形式化し、ラベルフリー設定において安定性と変動を結びつけるシンプルなルールであるEVolution-Oriented and Label-free Reinforcement Learning（EVOL-RL）を提案します。EVOL-RLは、多数決された答えを安定したアンカー（選択）として保持しつつ、既に生成されたものとは異なる推論を持つ回答を好む新奇性を意識した報酬を追加します（変動）。これは意味空間で測定されます。GRPOを用いて実装されたEVOL-RLは、強い信号を保持するための非対称クリッピングと、探索を維持するためのエントロピー正則化も使用します。この多数決による選択＋新奇性による変動の設計は、崩壊を防ぎ、より長く情報量の多い思考の連鎖を維持し、pass@1とpass@nの両方を改善します。EVOL-RLは、多数決のみのTTRLベースラインを一貫して上回ります。例えば、ラベルフリーのAIME24で訓練すると、Qwen3-4B-BaseのAIME25 pass@1はTTRLの4.6%から16.4%に、pass@16は18.5%から37.9%に向上します。EVOL-RLは多様性の崩壊を防ぐだけでなく、ドメイン間でのより強い汎化（例えばGPQA）も可能にします。さらに、EVOL-RLはRLVR設定でも性能を向上させることを示し、その幅広い適用性を強調します。

English

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.

ラベルなしでの言語モデルの進化：多数決が選択を駆動し、新奇性が多様性を促進する

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

要旨

Support