ホップ、スキップ、そして過剰思考：多段階推論における推論モデルの失敗原因の診断

要旨

推論モデルの出現とそれらを実用的なAIチャットボットに統合することにより、複雑で多段階の思考プロセスを必要とする高度な数学、深層検索、および抽出型質問応答問題の解決において画期的な進展がもたらされた。しかし、これらのモデルが汎用言語モデルよりも頻繁に幻覚を起こす理由についての完全な理解はまだ得られていない。本調査研究では、現代の言語モデルがマルチホップ質問応答タスクにおいて示す推論の失敗を体系的に探求する。我々は、3つの重要な次元にわたる失敗を検証する新規で微妙な誤り分類フレームワークを導入する。それらは、関連するソースドキュメントの多様性と独自性（「ホップ」）、関連情報の捕捉の完全性（「カバレッジ」）、および認知的非効率性（「過剰思考」）である。厳密な人間による注釈と補完的な自動化された指標を活用した探求を通じて、精度中心の評価ではしばしば見落とされる複雑な誤りパターンを明らかにする。この調査アプローチは、現在のモデルの認知的限界についての深い洞察を提供し、将来の言語モデリングの取り組みにおいて推論の忠実性、透明性、および堅牢性を向上させるための実践的な指針を提示する。

English

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

ホップ、スキップ、そして過剰思考：多段階推論における推論モデルの失敗原因の診断

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

要旨

Support