Hop, Skip, 그리고 Overthink: 다중 홉 분석 중 추론 모델이 실패하는 원인 진단하기

초록

추론 모델의 등장과 이를 실용적인 AI 챗봇에 통합한 것은 복잡하고 다단계의 사고 과정이 필요한 고급 수학, 심층 탐색, 추출형 질문 응답 문제 해결에서의 돌파구를 마련했습니다. 그러나 이러한 모델들이 일반 목적 언어 모델보다 더 자주 환각(hallucinate)을 일으키는 이유에 대한 완전한 이해는 아직 부족합니다. 본 탐구 연구에서는 다중 홉(multi-hop) 질문 응답 작업에서 현대 언어 모델들의 추론 실패를 체계적으로 탐구합니다. 우리는 세 가지 중요한 차원에서의 실패를 조사하는 새로운, 세밀한 오류 분류 프레임워크를 소개합니다: 관련 소스 문서의 다양성과 독창성("홉"), 관련 정보 포착의 완전성("커버리지"), 그리고 인지적 비효율성("오버씽킹"). 엄격한 인간 주석과 보완적인 자동화 지표를 통해, 우리의 탐구는 정확도 중심 평가로 인해 종종 숨겨져 있던 복잡한 오류 패턴을 밝혀냅니다. 이 탐구적 접근은 현재 모델들의 인지적 한계에 대한 더 깊은 통찰을 제공하며, 향후 언어 모델링 노력에서 추론의 정확성, 투명성, 견고성을 향상시키기 위한 실행 가능한 지침을 제시합니다.

English

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

Hop, Skip, 그리고 Overthink: 다중 홉 분석 중 추론 모델이 실패하는 원인 진단하기

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

초록

Support