BLEUBERI: BLEU는 지시 따르기 작업에서 놀라울 정도로 효과적인 보상 지표입니다.

초록

보상 모델은 LLM(Large Language Model)을 인간의 선호도에 맞추는 데 핵심적인 역할을 하지만, 이를 훈련시키는 데는 대규모의 인간이 라벨링한 선호도 데이터와 강력한 사전 훈련된 LLM 백본이 필요해 비용이 많이 듭니다. 한편, 고품질의 합성 명령어 수행 데이터셋이 점점 더 많이 제공되면서, RL(Reinforcement Learning) 기반 정렬 과정에서 보상 모델의 대안으로 더 간단한 참조 기반 메트릭을 사용할 수 있을지에 대한 질문이 제기됩니다. 본 논문에서는 먼저 기본적인 문자열 매칭 메트릭인 BLEU가 일반적인 명령어 수행 데이터셋에서 인간의 선호도와 일치하는 강력한 보상 모델과 놀랍도록 잘 맞는다는 것을 보여줍니다. 이 통찰을 바탕으로, 우리는 BLEUBERI라는 방법을 개발했습니다. 이 방법은 먼저 도전적인 명령어를 식별한 다음, BLEU를 직접 보상 함수로 사용하여 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)를 적용합니다. 우리는 BLEUBERI로 훈련된 모델이 네 가지 도전적인 명령어 수행 벤치마크와 세 가지 다른 기본 언어 모델에서 보상 모델 기반 RL로 훈련된 모델과 경쟁력이 있다는 것을 입증합니다. 인간 평가는 또한 BLEUBERI 모델 출력의 품질이 보상 모델 정렬 모델의 출력과 동등하다는 것을 추가로 뒷받침합니다. 더 나아가, BLEUBERI 모델은 경쟁 방법보다 사실에 더 근거한 출력을 생성합니다. 전반적으로, 우리는 고품질의 참조 출력(기존의 명령어 수행 데이터셋이나 합성 데이터 생성으로 쉽게 얻을 수 있음)에 접근할 수 있다면, 문자열 매칭 기반 메트릭이 정렬 과정에서 보상 모델의 저렴하면서도 효과적인 대안이 될 수 있음을 보여줍니다. 우리는 코드와 데이터를 https://github.com/lilakk/BLEUBERI에서 공개합니다.

English

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

BLEUBERI: BLEU는 지시 따르기 작업에서 놀라울 정도로 효과적인 보상 지표입니다.

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

초록

Support