在知识密集型任务中，推理模型的测试时缩放策略目前尚未显现显著效果

摘要

测试时扩展通过允许模型生成更长的推理链，增加了推理阶段的计算量，并在多个领域展现了强劲的性能。然而，本研究揭示，对于知识密集型任务而言，这一方法尚未显现出同等效力，此类任务对事实准确性和低幻觉率有着极高要求。我们利用12个推理模型在两个知识密集型基准上进行了全面的测试时扩展评估。结果表明，增加测试时的计算量并不能持续提升准确率，反而在许多情况下加剧了幻觉现象。随后，我们深入分析了延长推理如何影响幻觉行为，发现幻觉减少往往源于模型在深入思考后选择放弃作答，而非事实回忆能力的提升。相反，对于某些模型，更长的推理过程会促使其尝试回答之前未解的问题，其中不少导致了幻觉的产生。案例分析显示，延长的推理可能诱发确认偏误，导致过度自信的幻觉。尽管存在这些局限，我们观察到，相较于不进行思考，启用思考机制仍具优势。代码与数据已公开于https://github.com/XuZhao0/tts-knowledge。

English

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

在知识密集型任务中，推理模型的测试时缩放策略目前尚未显现显著效果

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

摘要

Support