跳躍、遺漏與過度思考：診斷推理模型在多跳躍分析中失誤的原因

摘要

推理模型的興起及其與實用人工智慧聊天機器人的整合，已在解決需要複雜多步思考過程的高階數學、深度搜索及抽取式問答問題上取得了突破。然而，對於這些模型為何比通用語言模型更容易產生幻覺，我們仍缺乏完整的理解。在本調查研究中，我們系統性地探討了當代語言模型在多跳躍問答任務上的推理失敗。我們引入了一個新穎且細緻的錯誤分類框架，該框架從三個關鍵維度審視失敗：涉及來源文件的多樣性與獨特性（“跳躍”）、捕捉相關信息的完整性（“覆蓋率”），以及認知效率低下（“過度思考”）。通過嚴格的人工標註，並輔以互補的自動化指標，我們的探索揭示了常被以準確性為中心的評估所掩蓋的複雜錯誤模式。此調查方法為當前模型的認知限制提供了更深入的見解，並為未來語言建模工作中提升推理的忠實度、透明度及魯棒性提供了可操作的指導。

English

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

跳躍、遺漏與過度思考：診斷推理模型在多跳躍分析中失誤的原因

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

摘要

Support