超越语义相似性：通过直接语料交互重构智能搜索的检索机制

摘要

现代检索系统，无论是基于词法还是语义，都通过固定的相似性接口来呈现语料库，将访问过程压缩为推理前的单一top-k检索步骤。这种抽象机制虽然高效，但在智能体搜索场景下却成为瓶颈：精确的词法约束、稀疏线索组合、局部上下文检查以及多步骤假设优化等操作，很难通过调用传统的现成检索器实现，且早期被过滤的证据无法通过下游更强推理能力恢复。智能体任务进一步加剧了这一局限，因为它们要求智能体协调多个步骤，包括发现中间实体、整合弱线索以及在观察到部分证据后调整计划。为突破此限制，我们研究直接语料库交互（DCI）方法，使智能体能够通过通用终端工具（如grep、文件读取、shell命令、轻量脚本）直接搜索原始语料库，无需任何嵌入模型、向量索引或检索API。该方法无需离线索引，能自然适应动态演变的本地语料库。在信息检索基准测试和端到端智能体搜索任务中，这种简单设置在多个BRIGHT和BEIR数据集上显著优于强稀疏检索、稠密检索及重排序基线，并在BrowseComp-Plus和多跳问答任务中取得优异准确率，且完全不依赖传统语义检索器。我们的结果表明：随着语言智能体能力增强，检索质量不仅取决于推理能力，更取决于模型与语料库交互接口的分辨率——DCI为此开辟了更广阔的智能体搜索接口设计空间。

English

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.