ChatPaper.aiChatPaper

参考文献:提升大型语言模型在不可验证领域的一致性

References Improve LLM Alignment in Non-Verifiable Domains

February 18, 2026
作者: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan
cs.AI

摘要

尽管可验证奖励的强化学习(RLVR)在推理任务中展现出强大效能,但其无法直接应用于缺乏真实验证器的不可验证领域(如大语言模型对齐)。本研究探讨了参考引导的LLM评估器能否作为软性“验证器”来弥合这一差距。首先,我们设计了利用参考输出增强基于LLM的评估器对齐能力的评估方案。通过系统实验发现:采用前沿模型参考输出可显著提升能力较弱LLM评判者的准确性;高质量(如人工撰写)参考亦能增强强力LLM评判者的表现。基于改进后的评判者,我们验证了高质量参考在对齐调优中的效用——通过参考引导的LLM作为评判器实现自我提升。实验表明,参考引导的自我改进相较于直接在参考输出上进行监督微调(SFT)以及无参考评判器的自我改进均取得显著增益,其性能可与使用强奖励模型ArmoRM的训练结果相媲美。具体而言,Llama-3-8B-Instruct模型在AlpacaEval和Arena-Hard上分别达到73.1%和58.7%的得分,Qwen2.5-7B模型则获得70.0%和74.1%的得分,相较SFT蒸馏在AlpacaEval/Arena-Hard上平均绝对提升达+20.2/+17.1分,较无参考自我改进提升+5.3/+3.6分。这些结果凸显了参考引导的LLM评估器在不可验证领域实现高效大语言模型后训练的潜力。
English
While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
PDF01February 21, 2026