IFDECORATOR：以可验证奖励机制封装指令跟随的强化学习

摘要

可验证奖励的强化学习（RLVR）提升了大型语言模型（LLMs）的指令遵循能力，但由于难度评估不足，存在训练效率低下的问题。此外，RLVR容易出现过优化现象，即LLMs利用验证捷径而不与用户指令的实际意图对齐。我们引入了指令遵循装饰器（IFDecorator），这是一个将RLVR训练封装为稳健且样本高效流程的框架。它包含三个组件：（1）一个合作对抗的数据飞轮，共同进化指令与混合验证，生成逐步更具挑战性的指令-验证对；（2）IntentCheck，一个确保意图对齐的旁路模块；（3）绊线，一种通过陷阱指令检测奖励作弊的诊断机制，这些陷阱指令触发并捕捉捷径利用行为。我们的Qwen2.5-32B-Instruct-IFDecorator在IFEval上达到了87.43%的准确率，超越了如GPT-4o等更大的专有模型。此外，我们在FollowBench上展示了显著改进，同时保持了通用能力。我们的绊线机制显著降低了奖励作弊率。我们将发布模型、代码和数据，以供未来研究使用。

English

Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

IFDECORATOR：以可验证奖励机制封装指令跟随的强化学习

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

摘要

Support