IFDECORATOR：以可驗證獎勵機制包裝的指令跟隨強化學習

摘要

基於可驗證獎勵的強化學習（RLVR）提升了大型語言模型（LLMs）的指令遵循能力，但由於難度評估不足，存在訓練效率低下的問題。此外，RLVR容易出現過度優化，即LLMs利用驗證捷徑而不對齊用戶指令的實際意圖。我們引入了指令遵循裝飾器（IFDecorator），這是一個將RLVR訓練封裝成穩健且樣本高效管線的框架。它包含三個組件：（1）一個合作對抗的數據飛輪，共同演化指令和混合驗證，生成逐步更具挑戰性的指令-驗證對；（2）IntentCheck，一個強制意圖對齊的旁路模塊；以及（3）觸發線，一種通過陷阱指令檢測獎勵黑客行為的診斷機制，這些陷阱指令觸發並捕捉捷徑利用行為。我們的Qwen2.5-32B-Instruct-IFDecorator在IFEval上達到了87.43%的準確率，超越了如GPT-4o等更大的專有模型。此外，我們在FollowBench上展示了顯著的改進，同時保持了通用能力。我們的觸發線顯示獎勵黑客率顯著降低。我們將發布模型、代碼和數據以供未來研究。

English

Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

IFDECORATOR：以可驗證獎勵機制包裝的指令跟隨強化學習

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

摘要

Support