VerIF：指令遵循中强化学习的验证工程

摘要

可驗證獎勵的強化學習（RLVR）已成為提升大型語言模型（LLMs）的關鍵技術，其中驗證工程扮演著核心角色。然而，指令遵循中的強化學習最佳實踐仍未被充分探索。在本研究中，我們探討了指令遵循中強化學習的驗證挑戰，並提出了VerIF，一種結合基於規則的代碼驗證與基於大型推理模型（如QwQ-32B）的LLM驗證方法。為支持這一方法，我們構建了一個高質量的指令遵循數據集VerInstruct，包含約22,000個帶有相關驗證信號的實例。我們將VerIF應用於兩個模型的強化學習訓練，在多個代表性指令遵循基準上取得了顯著提升。訓練後的模型在同等規模模型中達到了最先進的性能，並能很好地泛化到未見的約束條件。我們進一步觀察到，它們的通用能力未受影響，這表明結合VerIF的強化學習可以融入現有的強化學習方案中，以提升模型的整體性能。我們已公開數據集、代碼和模型，以促進未來研究，詳見https://github.com/THU-KEG/VerIF。

English

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.

VerIF：指令遵循中强化学习的验证工程

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

摘要

Support