通用推理器:推動大型語言模型跨領域推理能力
General-Reasoner: Advancing LLM Reasoning Across All Domains
May 20, 2025
作者: Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
cs.AI
摘要
強化學習(RL)近期在提升大型語言模型(LLMs)的推理能力方面展現出顯著潛力。特別是Deepseek-R1-Zero提出的「零」強化學習,使得無需依賴中間的監督微調階段即可直接對基礎LLMs進行RL訓練。儘管取得了這些進展,當前針對LLM推理的研究主要集中於數學和編程領域,這很大程度上得益於數據的豐富性和答案驗證的便捷性。這限制了此類模型在更廣泛領域中的適用性和泛化能力,這些領域的問題通常具有多樣的答案表示形式,且數據更為稀缺。本文提出了一種名為General-Reasoner的新穎訓練範式,旨在增強LLMs跨多領域的推理能力。我們的主要貢獻包括:(1)通過網絡爬取構建了一個大規模、高質量的問題數據集,涵蓋多個學科,並附有可驗證的答案;(2)開發了一種基於生成模型的答案驗證器,它利用思維鏈和上下文感知能力取代了傳統的基於規則的驗證方法。我們訓練了一系列模型,並在涵蓋物理、化學、金融、電子等多個領域的廣泛數據集上進行了評估。我們在12個基準測試(如MMLU-Pro、GPQA、SuperGPQA、TheoremQA、BBEH和MATH AMC)上的全面評估表明,General-Reasoner在保持數學推理任務中卓越有效性的同時,超越了現有的基線方法,展現出強大且可泛化的推理性能。
English
Reinforcement learning (RL) has recently demonstrated strong potential in
enhancing the reasoning capabilities of large language models (LLMs).
Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero,
enables direct RL training of base LLMs without relying on an intermediate
supervised fine-tuning stage. Despite these advancements, current works for LLM
reasoning mainly focus on mathematical and coding domains, largely due to data
abundance and the ease of answer verification. This limits the applicability
and generalization of such models to broader domains, where questions often
have diverse answer representations, and data is more scarce. In this paper, we
propose General-Reasoner, a novel training paradigm designed to enhance LLM
reasoning capabilities across diverse domains. Our key contributions include:
(1) constructing a large-scale, high-quality dataset of questions with
verifiable answers curated by web crawling, covering a wide range of
disciplines; and (2) developing a generative model-based answer verifier, which
replaces traditional rule-based verification with the capability of
chain-of-thought and context-awareness. We train a series of models and
evaluate them on a wide range of datasets covering wide domains like physics,
chemistry, finance, electronics etc. Our comprehensive evaluation across these
12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC)
demonstrates that General-Reasoner outperforms existing baseline methods,
achieving robust and generalizable reasoning performance while maintaining
superior effectiveness in mathematical reasoning tasks.Summary
AI-Generated Summary