扩展推理能力可提升大语言模型的事实准确性
Scaling Reasoning can Improve Factuality in Large Language Models
May 16, 2025
作者: Mike Zhang, Johannes Bjerva, Russa Biswas
cs.AI
摘要
近期关于大语言模型(LLM)推理能力的研究表明,通过利用较长的思考过程及在推理阶段增加计算资源,模型性能尤其在数学推理任务中展现出显著提升(Muennighoff等,2025)。然而,更长的推理链是否本质上能提高事实准确性,尤其是在非数学领域,仍存疑问。本研究中,我们深入探讨了LLM在复杂开放域问答(QA)场景下的推理能力。我们首先从先进的大规模推理模型(如QwQ-32B和DeepSeek-R1-671B)中提炼推理轨迹,随后对一系列模型进行微调,这些模型涵盖从小型指令调优变体到基于Qwen2.5的大型架构。为了丰富推理轨迹,我们引入了知识图谱中的事实信息,以路径形式融入推理轨迹中。实验设置包括四种基线方法和六种不同的指令调优模型,在包含超过22.6K问题的六个数据集基准上进行评估。总体而言,我们进行了168次实验运行,分析了约170万条推理轨迹。研究结果显示,在单次运行中,较小的推理模型相较于其原始指令调优版本,在事实准确性上实现了显著提升。此外,我们的分析表明,增加测试时的计算和令牌预算,事实准确性持续提升2-8%,进一步证实了测试时扩展对于提升性能及开放域QA任务中推理准确性的有效性。我们公开了所有实验材料,以供进一步研究。
English
Recent studies on large language model (LLM) reasoning capabilities have
demonstrated promising improvements in model performance by leveraging a
lengthy thinking process and additional computational resources during
inference, primarily in tasks involving mathematical reasoning (Muennighoff et
al., 2025). However, it remains uncertain if longer reasoning chains inherently
enhance factual accuracy, particularly beyond mathematical contexts. In this
work, we thoroughly examine LLM reasoning within complex open-domain
question-answering (QA) scenarios. We initially distill reasoning traces from
advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then
fine-tune a variety of models ranging from smaller, instruction-tuned variants
to larger architectures based on Qwen2.5. To enrich reasoning traces, we
introduce factual information from knowledge graphs in the form of paths into
our reasoning traces. Our experimental setup includes four baseline approaches
and six different instruction-tuned models evaluated across a benchmark of six
datasets, encompassing over 22.6K questions. Overall, we carry out 168
experimental runs and analyze approximately 1.7 million reasoning traces. Our
findings indicate that, within a single run, smaller reasoning models achieve
noticeable improvements in factual accuracy compared to their original
instruction-tuned counterparts. Moreover, our analysis demonstrates that adding
test-time compute and token budgets factual accuracy consistently improves by
2-8%, further confirming the effectiveness of test-time scaling for enhancing
performance and consequently improving reasoning accuracy in open-domain QA
tasks. We release all the experimental artifacts for further research.Summary
AI-Generated Summary