在知識密集型任務中，推理模型的測試時縮放尚未顯現出顯著效果

摘要

測試時擴展（Test-time scaling）通過允許模型生成更長的推理鏈來增加推理階段的計算量，並在多個領域展現出強勁的性能。然而，在本研究中，我們發現這種方法對於知識密集型任務尚未有效，這類任務要求高事實準確性和低幻覺率。我們使用12個推理模型在兩個知識密集型基準上進行了全面的測試時擴展評估。結果顯示，增加測試時計算並不能一致性地提升準確性，且在許多情況下，反而導致更多幻覺。我們進一步分析了延長推理如何影響幻覺行為。發現幻覺的減少往往源於模型在深入思考後選擇放棄回答，而非事實回憶能力的提升。相反，對於某些模型，更長的推理會激勵其嘗試回答之前未答的問題，其中許多回答會產生幻覺。案例研究表明，延長推理可能誘發確認偏誤，導致過度自信的幻覺。儘管存在這些限制，我們觀察到，與不進行思考相比，啟用思考仍是有益的。代碼和數據可在https://github.com/XuZhao0/tts-knowledge獲取。

English

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

在知識密集型任務中，推理模型的測試時縮放尚未顯現出顯著效果

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

摘要

Support