超越語義相似性：透過直接語料庫互動重新思考能動性搜尋的檢索機制

摘要

當代檢索系統（無論是詞彙型或語義型）皆透過固定的相似性介面來呈現語料庫，將存取過程壓縮為單一的前置檢索步驟後再進行推理。這種抽象化設計雖具效率，但在智能代理搜索場景中卻成為瓶頸：精確的詞彙約束、稀疏線索的組合驗證、局部上下文檢查以及多步驟假設優化等操作，很難通過調用傳統現成檢索器實現，且早期被過濾的證據無法透過下游更強大的推理過程恢復。智能代理任務更凸顯此局限，因為其要求代理協調多個步驟，包括發現中間實體、整合薄弱線索，以及在觀察部分證據後調整計劃。為突破此限制，我們研究直接語料庫交互（DCI）方法，讓代理通過通用終端工具（如grep、文件讀取、Shell命令、輕量腳本）直接搜索原始語料庫，無需依賴任何嵌入模型、向量索引或檢索API。這種方法無需離線索引，能自然適應動態變化的本地語料庫。在信息檢索基準測試和端到端智能代理搜索任務中，此簡潔方案在多個BRIGHT和BEIR數據集上顯著超越強勁的稀疏檢索、稠密檢索及重排序基線模型，並在BrowseComp-Plus和多跳問答任務中實現高準確率，且完全不依賴傳統語義檢索器。我們的結果表明：隨著語言代理能力增強，檢索質量不僅取決於推理能力，更取決於模型與語料庫交互介面的解析度。DCI由此為智能代理搜索開闢了更廣闊的介面設計空間。

English

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

超越語義相似性：透過直接語料庫互動重新思考能動性搜尋的檢索機制

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

摘要

Support