基于元评估的强化学习:无需真实标签的语言模型对齐方法
Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels
January 29, 2026
作者: Micah Rentschler, Jesse Roberts
cs.AI
摘要
目前大多数用于训练大型语言模型(LLM)的强化学习方法都需要真实标签或特定任务的验证器,这在正确性难以判定或获取成本高昂时限制了方法的可扩展性。我们提出基于元评估的强化学习(RLME)方法,该方法通过评估者对自然语言元问题(如“答案是否正确?”或“推理是否逻辑一致?”)的反馈生成奖励信号来优化生成器。RLME将评估者给出积极判断的概率作为奖励,并通过组相对策略优化更新生成器,从而实现无标签学习。一系列实验表明:RLME在准确性和样本效率上可与基于标签的训练相媲美;支持多目标间的可控权衡;引导模型形成可靠推理模式而非事后合理化;在缺乏真实标签的开放域场景中仍具泛化能力,从而拓展了强化学习在LLM训练中的应用领域。
English
Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., "Is the answer correct?" or "Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL.