ChatPaper.aiChatPaper

提升大型语言模型在不可验证领域的对齐性参考文献

References Improve LLM Alignment in Non-Verifiable Domains

February 18, 2026
作者: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan
cs.AI

摘要

尽管可验证奖励的强化学习(RLVR)在推理任务中展现出强大效能,但其无法直接应用于缺乏真实验证器的非可验证领域(如大语言模型对齐)。本研究探讨了参考引导的LLM评估器能否通过充当软性"验证器"来弥合这一差距。首先,我们设计了利用参考输出增强基于LLM的评估器在对齐任务中的评估方案。通过全面实验发现:采用前沿模型生成的参考输出可显著提升能力较弱LLM评判者的准确性;高质量(即人工撰写)参考输出也能增强更强LLM评判者的表现。基于这些改进的评判者,我们验证了高质量参考在对齐调优中的效用——通过参考引导的LLM作为评判者进行自我优化。实验表明,参考引导的自我改进相较于直接在参考输出上进行监督微调(SFT)以及使用无参考评判者的自我改进均取得明显增益,其性能可与使用强奖励模型ArmoRM的训练结果相媲美。具体而言,我们的方法在Llama-3-8B-Instruct模型上于AlpacaEval和Arena-Hard分别达到73.1%和58.7%的得分,在Qwen2.5-7B模型上分别达到70.0%和74.1%的得分,相较于SFT蒸馏在AlpacaEval/Arena-Hard上平均绝对提升达+20.2/+17.1分,相较于无参考自我改进平均提升+5.3/+3.6分。这些结果凸显了参考引导的LLM评估器在非可验证领域实现有效大语言模型后训练的潜力。
English
While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
PDF01February 21, 2026