跳跃、遗漏与过度思考：诊断推理模型在多步分析中失误的原因

摘要

推理模型的出现及其与实用AI聊天机器人的整合，在解决需要复杂多步思维过程的高级数学、深度搜索和抽取式问答问题上取得了突破性进展。然而，对于这些模型为何比通用语言模型更容易产生幻觉，我们仍缺乏全面的理解。在本项调查研究中，我们系统性地探讨了当代语言模型在多跳问答任务中的推理失败现象。我们引入了一种新颖且细致的错误分类框架，该框架从三个关键维度审视失败原因：涉及源文档的多样性与独特性（“跳数”）、捕捉相关信息完整性（“覆盖度”）以及认知效率低下（“过度思考”）。通过严格的人工标注，辅以互补的自动化指标，我们的探索揭示了常被以准确率为中心的评估所掩盖的复杂错误模式。这种调查方法为深入理解当前模型的认知局限提供了洞见，并为未来语言建模工作中提升推理的准确性、透明度和鲁棒性提供了可操作的指导。

English

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

跳跃、遗漏与过度思考：诊断推理模型在多步分析中失误的原因

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

摘要

Support