BLEUBERI：BLEU作为指令跟随的奖励机制，其效果出人意料地显著。

摘要

奖励模型在将大语言模型（LLMs）与人类偏好对齐中扮演核心角色，但其训练成本高昂，需要大规模人工标注的偏好数据及强大的预训练LLM骨干。与此同时，高质量合成指令跟随数据集的日益普及引发了一个问题：在基于强化学习的对齐过程中，能否用更简单的、基于参考的指标替代奖励模型？本文首先揭示，在通用指令跟随数据集上，基本的字符串匹配指标BLEU意外地与强奖励模型在人类偏好一致性上表现相当。基于这一洞察，我们开发了BLEUBERI方法，该方法首先识别具有挑战性的指令，随后直接采用BLEU作为奖励函数，实施群体相对策略优化（GRPO）。我们证明，在四个高难度指令跟随基准测试及三种不同基础语言模型上，BLEUBERI训练的模型与通过奖励模型引导的RL训练模型表现相当。人类评估进一步支持，BLEUBERI模型输出的质量与奖励模型对齐模型持平。此外，BLEUBERI模型生成的输出在事实依据上优于竞争方法。总体而言，我们展示了在获得高质量参考输出（易于通过现有指令跟随数据集或合成数据生成获取）的情况下，基于字符串匹配的指标是对齐过程中奖励模型既经济又有效的替代品。我们在https://github.com/lilakk/BLEUBERI发布了代码与数据。

English

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

BLEUBERI：BLEU作为指令跟随的奖励机制，其效果出人意料地显著。

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

摘要

Support