在知识密集型任务中,推理模型的测试时缩放策略目前尚未显现显著效果
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
September 8, 2025
作者: James Xu Zhao, Bryan Hooi, See-Kiong Ng
cs.AI
摘要
测试时扩展通过允许模型生成更长的推理链,增加了推理阶段的计算量,并在多个领域展现了强劲的性能。然而,本研究揭示,对于知识密集型任务而言,这一方法尚未显现出同等效力,此类任务对事实准确性和低幻觉率有着极高要求。我们利用12个推理模型在两个知识密集型基准上进行了全面的测试时扩展评估。结果表明,增加测试时的计算量并不能持续提升准确率,反而在许多情况下加剧了幻觉现象。随后,我们深入分析了延长推理如何影响幻觉行为,发现幻觉减少往往源于模型在深入思考后选择放弃作答,而非事实回忆能力的提升。相反,对于某些模型,更长的推理过程会促使其尝试回答之前未解的问题,其中不少导致了幻觉的产生。案例分析显示,延长的推理可能诱发确认偏误,导致过度自信的幻觉。尽管存在这些局限,我们观察到,相较于不进行思考,启用思考机制仍具优势。代码与数据已公开于https://github.com/XuZhao0/tts-knowledge。
English
Test-time scaling increases inference-time computation by allowing models to
generate long reasoning chains, and has shown strong performance across many
domains. However, in this work, we show that this approach is not yet effective
for knowledge-intensive tasks, where high factual accuracy and low
hallucination rates are essential. We conduct a comprehensive evaluation of
test-time scaling using 12 reasoning models on two knowledge-intensive
benchmarks. Our results reveal that increasing test-time computation does not
consistently improve accuracy and, in many cases, it even leads to more
hallucinations. We then analyze how extended reasoning affects hallucination
behavior. We find that reduced hallucinations often result from the model
choosing to abstain after thinking more, rather than from improved factual
recall. Conversely, for some models, longer reasoning encourages attempts on
previously unanswered questions, many of which result in hallucinations. Case
studies show that extended reasoning can induce confirmation bias, leading to
overconfident hallucinations. Despite these limitations, we observe that
compared to non-thinking, enabling thinking remains beneficial. Code and data
are available at https://github.com/XuZhao0/tts-knowledge