IFDECORATOR:以可验证奖励机制封装指令跟随的强化学习
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
August 6, 2025
作者: Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
cs.AI
摘要
可验证奖励的强化学习(RLVR)提升了大型语言模型(LLMs)的指令遵循能力,但由于难度评估不足,存在训练效率低下的问题。此外,RLVR容易出现过优化现象,即LLMs利用验证捷径而不与用户指令的实际意图对齐。我们引入了指令遵循装饰器(IFDecorator),这是一个将RLVR训练封装为稳健且样本高效流程的框架。它包含三个组件:(1)一个合作对抗的数据飞轮,共同进化指令与混合验证,生成逐步更具挑战性的指令-验证对;(2)IntentCheck,一个确保意图对齐的旁路模块;(3)绊线,一种通过陷阱指令检测奖励作弊的诊断机制,这些陷阱指令触发并捕捉捷径利用行为。我们的Qwen2.5-32B-Instruct-IFDecorator在IFEval上达到了87.43%的准确率,超越了如GPT-4o等更大的专有模型。此外,我们在FollowBench上展示了显著改进,同时保持了通用能力。我们的绊线机制显著降低了奖励作弊率。我们将发布模型、代码和数据,以供未来研究使用。
English
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction
following capabilities of large language models (LLMs), but suffers from
training inefficiency due to inadequate difficulty assessment. Moreover, RLVR
is prone to over-optimization, where LLMs exploit verification shortcuts
without aligning to the actual intent of user instructions. We introduce
Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR
training into a robust and sample-efficient pipeline. It consists of three
components: (1) a cooperative-adversarial data flywheel that co-evolves
instructions and hybrid verifications, generating progressively more
challenging instruction-verification pairs; (2) IntentCheck, a bypass module
enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that
detects reward hacking via trap instructions, which trigger and capture
shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves
87.43% accuracy on IFEval, outperforming larger proprietary models such as
GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench
while preserving general capabilities. Our trip wires show significant
reductions in reward hacking rates. We will release models, code, and data for
future research.