VerIF:指令遵循中强化学习的验证工程
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
June 11, 2025
作者: Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型(LLMs)的關鍵技術,其中驗證工程扮演著核心角色。然而,指令遵循中的強化學習最佳實踐仍未被充分探索。在本研究中,我們探討了指令遵循中強化學習的驗證挑戰,並提出了VerIF,一種結合基於規則的代碼驗證與基於大型推理模型(如QwQ-32B)的LLM驗證方法。為支持這一方法,我們構建了一個高質量的指令遵循數據集VerInstruct,包含約22,000個帶有相關驗證信號的實例。我們將VerIF應用於兩個模型的強化學習訓練,在多個代表性指令遵循基準上取得了顯著提升。訓練後的模型在同等規模模型中達到了最先進的性能,並能很好地泛化到未見的約束條件。我們進一步觀察到,它們的通用能力未受影響,這表明結合VerIF的強化學習可以融入現有的強化學習方案中,以提升模型的整體性能。我們已公開數據集、代碼和模型,以促進未來研究,詳見https://github.com/THU-KEG/VerIF。
English
Reinforcement learning with verifiable rewards (RLVR) has become a key
technique for enhancing large language models (LLMs), with verification
engineering playing a central role. However, best practices for RL in
instruction following remain underexplored. In this work, we explore the
verification challenge in RL for instruction following and propose VerIF, a
verification method that combines rule-based code verification with LLM-based
verification from a large reasoning model (e.g., QwQ-32B). To support this
approach, we construct a high-quality instruction-following dataset,
VerInstruct, containing approximately 22,000 instances with associated
verification signals. We apply RL training with VerIF to two models, achieving
significant improvements across several representative instruction-following
benchmarks. The trained models reach state-of-the-art performance among models
of comparable size and generalize well to unseen constraints. We further
observe that their general capabilities remain unaffected, suggesting that RL
with VerIF can be integrated into existing RL recipes to enhance overall model
performance. We have released our datasets, codes, and models to facilitate
future research at https://github.com/THU-KEG/VerIF.