ChatPaper.aiChatPaper

RLBFF:二进制灵活反馈,架起人类反馈与可验证奖励之间的桥梁

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

September 25, 2025
作者: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
cs.AI

摘要

基于人类反馈的强化学习(RLHF)与基于可验证奖励的强化学习(RLVR)是大型语言模型(LLM)后训练阶段采用的主要强化学习范式,各自具备独特优势。然而,RLHF因依赖缺乏明确标准的人类判断,在可解释性和奖励操纵方面面临挑战;而RLVR则因其专注于基于正确性的验证机制,应用范围受限。为此,我们提出了基于二元灵活反馈的强化学习(RLBFF),它融合了人类驱动偏好的灵活性与基于规则验证的精确性,使奖励模型能够捕捉超越单纯正确性的回答质量细微之处。RLBFF从自然语言反馈中提取可二元化回答的原则(如信息准确性:是,或代码可读性:否),进而将这些原则作为奖励模型训练的基础,将其转化为一个蕴含任务(回答满足或不满足任意原则)。研究表明,在数据匹配的情况下,以此方式训练的奖励模型优于Bradley-Terry模型,并在RM-Bench(86.2%)和JudgeBench(81.4%,截至2025年9月24日位居榜首)上取得顶尖性能。此外,与Bradley-Terry模型不同,用户可在推理时指定关注原则,以定制奖励模型的关注点。最后,我们提供了一套完全开源的方案(包括数据),利用RLBFF及我们的奖励模型对齐Qwen3-32B,在MT-Bench、WildBench和Arena Hard v2等通用对齐基准测试中,以低于5%的推理成本,达到或超越o3-mini和DeepSeek R1的性能。
English
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).
PDF12September 29, 2025