GrepSeek：訓練能直接與語料庫互動的搜尋代理

摘要

大型語言模型（LLM）搜尋代理已展現出透過多輪推理與資訊檢索來處理知識密集型語言任務的強大潛力。現有系統大多使用檢索器來存取資訊，該檢索器接收關鍵詞或自然語言查詢，並根據預先計算的文件表示索引，回傳一份排序後的文件清單。在本研究中，我們探索了一個互補觀點：搜尋代理將語料庫本身視為搜尋環境，並透過執行可執行的 shell 命令來尋找證據。我們提出 GrepSeek，這是一個最佳化的直接語料庫互動（DCI）搜尋代理，能夠訓練一個輕量化的搜尋代理，從大型文字語料庫中尋找、篩選並組合證據。為了解決直接在大型語料庫上使用強化學習進行行為學習時的不穩定性，我們提出了一個兩階段訓練流程。首先，我們使用具答案感知能力的 Tutor 與不具答案感知能力的 Planner 來建構一個冷啟動資料集，以產生經過驗證且具因果基礎的搜尋軌跡。其次，我們利用群體相對策略最佳化（GRPO）來精煉初始化的策略，使代理能透過直接與語料庫互動，改善其任務導向的搜尋行為。為使 DCI 在大規模應用中可行，我們進一步採用了一種保留語義的分片平行執行引擎，該引擎可將 Shell 基礎的檢索加速高達 7.6 倍，同時保持與 Shell 命令循序執行時的字元級完全一致。在七個開放域問答基準測試上的實驗結果顯示，GrepSeek 在整體詞元層級的 F1 分數與完全匹配（Exact Match）上表現最佳。我們的分析也指出了純粹詞彙互動在面對表面形式變化較大的查詢時的限制，從而建議 DCI 可作為搜尋代理在現實世界中互補現有檢索典範的一種實用且具競爭力的方法。

English

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6times while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level F_1 and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.