ChatPaper.aiChatPaper

检索增强型大语言模型中的过度检索问题

Over-Searching in Search-Augmented Large Language Models

January 9, 2026
作者: Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra
cs.AI

摘要

基于检索增强的大型语言模型(LLMs)通过整合外部检索信息,在知识密集型任务中表现卓越。然而,这类模型常出现过度搜索现象——即便无助于提升回答质量时仍不必要地调用搜索工具,导致计算效率低下,并因引入无关上下文而产生幻觉。本研究从查询类型、模型类别、检索条件及多轮对话等多个维度对过度搜索进行了系统性评估。主要发现包括:(一)搜索通常能提升可回答查询的答案准确率,但会削弱模型对不可回答查询的拒答能力;(二)过度搜索在复杂推理模型和深度研究系统中更为显著,会因噪声检索而加剧,并在多轮对话中随轮次增加而累积;(三)检索证据的构成至关重要,负面证据的存在能有效提升模型的拒答能力。为量化过度搜索,我们提出了"每正确率所需标记数"(TPC)这一评估指标,用以衡量检索增强型LLMs的性能与成本权衡。最后,我们探索了查询层面和检索层面的缓解策略,并发布OverSearchQA数据集以推动高效检索增强型LLMs的持续研究。
English
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.
PDF21January 13, 2026