通用推理器:推动大语言模型跨领域推理能力发展
General-Reasoner: Advancing LLM Reasoning Across All Domains
May 20, 2025
作者: Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
cs.AI
摘要
强化学习(RL)近期在提升大型语言模型(LLMs)的推理能力方面展现出显著潜力。特别是由Deepseek-R1-Zero提出的“零”强化学习,使得基础LLMs无需依赖中间阶段的监督微调即可直接进行RL训练。尽管取得了这些进展,当前针对LLM推理的研究主要集中于数学和编程领域,这很大程度上得益于数据的丰富性及答案验证的便捷性。然而,这限制了此类模型在更广泛领域的适用性和泛化能力,这些领域的问题往往具有多样化的答案表达形式,且数据更为稀缺。本文提出了一种新颖的训练范式——通用推理器(General-Reasoner),旨在增强LLMs跨领域的推理能力。我们的主要贡献包括:(1)通过网页爬取构建了一个大规模、高质量的问题数据集,这些问题带有可验证答案,覆盖了广泛的学科领域;(2)开发了一种基于生成模型的答案验证器,它利用思维链和上下文感知能力替代了传统的基于规则的验证方法。我们训练了一系列模型,并在涵盖物理、化学、金融、电子等多个领域的广泛数据集上进行了评估。在包括MMLU-Pro、GPQA、SuperGPQA、TheoremQA、BBEH和MATH AMC在内的12个基准测试中,全面评估表明,通用推理器超越了现有的基线方法,在保持数学推理任务高效性的同时,实现了稳健且可泛化的推理性能。
English
Reinforcement learning (RL) has recently demonstrated strong potential in
enhancing the reasoning capabilities of large language models (LLMs).
Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero,
enables direct RL training of base LLMs without relying on an intermediate
supervised fine-tuning stage. Despite these advancements, current works for LLM
reasoning mainly focus on mathematical and coding domains, largely due to data
abundance and the ease of answer verification. This limits the applicability
and generalization of such models to broader domains, where questions often
have diverse answer representations, and data is more scarce. In this paper, we
propose General-Reasoner, a novel training paradigm designed to enhance LLM
reasoning capabilities across diverse domains. Our key contributions include:
(1) constructing a large-scale, high-quality dataset of questions with
verifiable answers curated by web crawling, covering a wide range of
disciplines; and (2) developing a generative model-based answer verifier, which
replaces traditional rule-based verification with the capability of
chain-of-thought and context-awareness. We train a series of models and
evaluate them on a wide range of datasets covering wide domains like physics,
chemistry, finance, electronics etc. Our comprehensive evaluation across these
12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC)
demonstrates that General-Reasoner outperforms existing baseline methods,
achieving robust and generalizable reasoning performance while maintaining
superior effectiveness in mathematical reasoning tasks.Summary
AI-Generated Summary