BLEUBERI：BLEU作為指令遵循的獎勵機制，其效果出奇地顯著

摘要

獎勵模型在將大型語言模型（LLMs）與人類偏好對齊中扮演核心角色，但其訓練成本高昂，需要大規模的人類標註偏好數據及強大的預訓練LLM骨幹。與此同時，高質量合成指令跟隨數據集的日益普及引發了一個問題：在基於強化學習的對齊過程中，能否以更簡單的基於參考的指標作為獎勵模型的可行替代方案？本文首先揭示，BLEU這一基礎字符串匹配指標，在通用指令跟隨數據集上，與強獎勵模型在與人類偏好一致性方面表現出驚人的匹配度。基於這一洞察，我們開發了BLEUBERI方法，該方法首先識別具有挑戰性的指令，隨後直接將BLEU作為獎勵函數應用於群組相對策略優化（GRPO）。我們證明，在四個具有挑戰性的指令跟隨基準測試及三種不同基礎語言模型上，BLEUBERI訓練的模型與通過獎勵模型指導的強化學習訓練的模型表現相當。進一步的人類評估支持了BLEUBERI模型輸出質量與獎勵模型對齊模型相當的結論。此外，BLEUBERI模型生成的輸出在事實基礎上比競爭方法更為紮實。總體而言，我們展示了在獲得高質量參考輸出（易於通過現有指令跟隨數據集或合成數據生成獲取）的情況下，基於字符串匹配的指標在對齊過程中是獎勵模型既經濟又有效的替代品。我們在https://github.com/lilakk/BLEUBERI上公開了代碼與數據。

English

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

BLEUBERI：BLEU作為指令遵循的獎勵機制，其效果出奇地顯著

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

摘要

Support