Dr-DCI：透過動態工作區擴展實現規模化直接語料庫互動

摘要

在大規模語料庫上進行代理式搜索，依賴檢索器中介界面（如BM25或ColBERT）來實現可擴展的候選項發現。雖然這些界面在對相關文檔進行排序上相當有效，但它們僅以排序結果或受限的文檔視圖來呈現證據，限制了代理重組材料以及跨文檔驗證約束的能力。直接語料庫互動（DCI）透過暴露可用於靈活搜索、篩選、比較及驗證的shell可執行語料庫操作來解決此限制。然而，隨著語料庫增長，完整語料庫的終端指令變得緩慢且不穩定，導致性能與效率下降。我們提出DR-DCI，一個由檢索器引導的DCI框架，將檢索視為代理可調用的動作，以擴展局部工作空間。代理並非直接對完整語料庫進行操作，而是動態地將相關文檔拉入一個不斷演進的工作空間，並在其中執行DCI操作。此設計結合了檢索器層級的召回率與DCI風格的精確度：檢索保持了探索的可擴展性，而DCI則保留了有效證據解析所需的局部操作。實驗表明，DR-DCI在不同規模下均兼具有效性與效率。在Browsecomp-Plus上，DR-DCI達到71.2%的準確率，相比原始DCI及各消融變體提升了最多8.3個百分點，同時減少了工具使用次數、實際時間與估計成本。採用保留工作空間的上下文重置後，準確率進一步提升至73.3%。在語料庫擴展實驗中，DR-DCI在10萬到1000萬文檔範圍內保持有效，而原始DCI變得不穩定，BM25表現則明顯較差。DR-DCI還可擴展至2000萬規模的每文件一文檔的Wiki-18問答設定，在六個基準測試中取得平均63.0分，優於基於檢索及訓練式搜索代理的基線方法。消融分析進一步顯示，排序預覽及跨文檔DCI是性能關鍵。

English

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.