ChatPaper.aiChatPaper

擴展推理能力可提升大型語言模型的事實準確性

Scaling Reasoning can Improve Factuality in Large Language Models

May 16, 2025
作者: Mike Zhang, Johannes Bjerva, Russa Biswas
cs.AI

摘要

近期關於大型語言模型(LLM)推理能力的研究顯示,通過在推理過程中利用冗長的思考過程和額外的計算資源,模型性能得到了顯著提升,這主要體現在涉及數學推理的任務中(Muennighoff等,2025)。然而,尚不確定更長的推理鏈是否本質上能提高事實準確性,尤其是在數學之外的領域。在本研究中,我們深入探討了LLM在複雜開放域問答(QA)場景中的推理能力。我們首先從先進的大規模推理模型(QwQ-32B和DeepSeek-R1-671B)中提取推理軌跡,然後對一系列模型進行微調,這些模型從較小的指令微調變體到基於Qwen2.5的更大架構不等。為了豐富推理軌跡,我們將知識圖譜中的事實信息以路徑形式引入推理軌跡中。我們的實驗設置包括四種基線方法和六種不同的指令微調模型,這些模型在包含超過22.6K問題的六個數據集基準上進行評估。總體而言,我們進行了168次實驗運行,並分析了約170萬條推理軌跡。研究結果表明,在單次運行中,較小的推理模型在事實準確性方面相比其原始指令微調版本取得了顯著提升。此外,我們的分析顯示,增加測試時的計算資源和令牌預算,事實準確性持續提高了2-8%,進一步證明了測試時擴展對於提升性能的有效性,從而提高了開放域QA任務中的推理準確性。我們將所有實驗成果公開以供進一步研究。
English
Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.

Summary

AI-Generated Summary

PDF52May 19, 2025