ChatPaper.aiChatPaper

无需验证器强化通用推理能力

Reinforcing General Reasoning without Verifiers

May 27, 2025
作者: Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du
cs.AI

摘要

近期,采用DeepSeek-R1-Zero式强化学习(RL)基于可验证奖励训练大型语言模型(LLMs)的范式转变,在代码和数学推理领域取得了显著进展。然而,该方法仅限于那些能够通过规则进行答案验证的任务,难以自然扩展到化学、医疗、工程、法律、生物学、商业及经济学等现实世界领域。当前的实际解决方案是额外使用一个LLM作为基于模型的验证器,但这带来了诸如依赖强大的验证器LLM、易受奖励欺骗影响以及在训练过程中需将验证器模型保留在内存中的实际负担等问题。为解决这一问题并将DeepSeek-R1-Zero式训练推广至一般推理领域,我们提出了一种无需验证器的方法(VeriFree),该方法绕过答案验证,转而利用RL直接最大化生成参考答案的概率。我们将VeriFree与基于验证器的方法进行了比较,结果表明,除了显著的实践优势和降低的计算需求外,VeriFree在MMLU-Pro、GPQA、SuperGPQA及数学相关基准测试的广泛评估中,不仅匹配甚至超越了基于验证器的方法。此外,我们从多个角度深入探讨了该方法:作为策略与隐式验证器在统一模型中优雅集成的训练方式,以及作为一种变分优化策略。代码可在https://github.com/sail-sg/VeriFree获取。
English
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Summary

AI-Generated Summary

PDF242May 28, 2025