RLBFF: 인간 피드백과 검증 가능한 보상 간의 간극을 메우는 이진 유연 피드백

초록

인간 피드백을 활용한 강화 학습(Reinforcement Learning with Human Feedback, RLHF)과 검증 가능한 보상을 사용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대형 언어 모델(LLM)의 사후 훈련에 주로 사용되는 주요 강화 학습 패러다임으로, 각각 고유한 장점을 제공합니다. 그러나 RLHF는 명시적인 기준이 부족한 인간의 판단에 의존하기 때문에 해석 가능성과 보상 해킹 문제에 직면하는 반면, RLVR은 정확성 기반 검증에 초점을 맞추어 범위가 제한적입니다. 본 연구에서는 인간 주도적 선호도의 다양성과 규칙 기반 검증의 정밀성을 결합한 이진 유연 피드백 강화 학습(Reinforcement Learning with Binary Flexible Feedback, RLBFF)을 제안합니다. 이를 통해 보상 모델이 단순한 정확성을 넘어 응답 품질의 미묘한 측면을 포착할 수 있도록 합니다. RLBFF는 자연어 피드백에서 이진 방식으로 답변 가능한 원칙(예: 정보의 정확성: 예, 코드 가독성: 아니오)을 추출합니다. 이러한 원칙은 보상 모델 훈련을 함의 작업(응답이 임의의 원칙을 충족하는지 여부)으로 기반을 마련하는 데 사용될 수 있습니다. 본 연구는 이러한 방식으로 훈련된 보상 모델이 동일한 데이터 조건에서 Bradley-Terry 모델을 능가하며, RM-Bench(86.2%)와 JudgeBench(2025년 9월 24일 기준 리더보드 1위, 81.4%)에서 최고 성능을 달성함을 보여줍니다. 또한, Bradley-Terry 모델과 달리 사용자는 추론 시 관심 있는 원칙을 지정하여 보상 모델의 초점을 맞춤 설정할 수 있습니다. 마지막으로, RLBFF와 보상 모델을 사용하여 Qwen3-32B를 정렬하는 완전 오픈 소스 레시피(데이터 포함)를 제시하며, 이를 통해 MT-Bench, WildBench, Arena Hard v2와 같은 일반 정렬 벤치마크에서 o3-mini 및 DeepSeek R1의 성능을 맞추거나 능가하는 결과를 달성합니다(추론 비용의 5% 미만).

English

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).

RLBFF: 인간 피드백과 검증 가능한 보상 간의 간극을 메우는 이진 유연 피드백

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

초록

Support