ChatPaper.aiChatPaper

检索增强型大语言模型中的过度检索问题

Over-Searching in Search-Augmented Large Language Models

January 9, 2026
作者: Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra
cs.AI

摘要

检索增强型大语言模型通过整合外部检索信息,在知识密集型任务中表现卓越。然而,这类模型常出现过度搜索现象——即使检索无助于提升回答质量,仍不必要地调用搜索工具,导致计算效率低下,并因引入无关内容而产生幻觉。本研究从查询类型、模型类别、检索条件及多轮对话等多个维度对过度搜索进行了系统性评估。主要发现包括:(i)检索通常能提升可回答查询的准确率,但会削弱模型对不可回答问题的拒答能力;(ii)过度搜索在复杂推理模型和深度研究系统中更为显著,噪声检索会加剧该现象,且在多轮对话中呈现累积效应;(iii)检索证据的构成至关重要,负面证据的存在能提升模型的拒答性能。为量化过度搜索,我们提出"每正确率消耗令牌数"(TPC)指标,用以衡量检索增强型LLMs性能与成本的平衡关系。最后,我们探索了查询层和检索层的缓解策略,并发布OverSearchQA数据集以促进检索增强型LLMs效率优化的持续研究。
English
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.
PDF21January 13, 2026