Dr-DCI：通过动态工作空间扩展实现直接语料库交互的规模化

摘要

智能体搜索大规模语料库依赖检索器中介接口（如BM25或ColBERT）实现可扩展的候选发现。这些接口虽能有效排序相关文档，但仅以排序结果或有界文档视图呈现证据，限制了智能体重组材料、跨文档验证约束的能力。直接语料库交互（DCI）通过暴露可执行shell命令的语料库操作，实现灵活的搜索、过滤、比较和验证，从而解决了这一限制。然而，随着语料库规模增长，全语料库终端命令变得缓慢且不稳定，导致性能和效率下降。我们提出DR-DCI，一种检索器引导的DCI框架，将检索视为智能体可调用的行动，用于扩展局部工作空间。智能体并非直接在全语料库上操作，而是动态地将相关文档拉入不断演化的局部工作空间，并在此空间内执行DCI操作。这种设计兼顾了检索器级别的召回率与DCI风格的精确性：检索保持探索的可扩展性，而DCI则保留有效证据解析所需的局部操作。实验表明，DR-DCI在不同规模下均高效有效。在Browsecomp-Plus数据集上，DR-DCI达到71.2%的准确率，相比原始DCI及消融变体提升高达8.3个百分点，同时减少了工具使用次数、运行时间和估算成本。采用保留工作空间的上下文重置后，准确率进一步提升至73.3%。在语料库规模扩展实验中，DR-DCI在10万到1000万文档范围内保持有效，而原始DCI变得不稳定，BM25表现显著更差。DR-DCI还能扩展到2000万文档（每文档对应一个文件）规模的Wiki-18 QA场景，在六个基准测试中平均得分63.0，优于基于检索和基于训练的搜索智能体基线。消融分析进一步表明，排序预览和跨文档DCI对性能至关重要。

English

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.