GrepSeek: コーパスとの直接的な相互作用のための検索エージェントの訓練

要旨

大規模言語モデル（LLM）を用いた検索エージェントは、複数回の推論と情報検索を通じて、知識集約型言語タスクにおいて大きな可能性を示している。既存のほとんどのシステムでは、キーワードや自然言語クエリを受け取り、事前計算された文書表現のインデックスを用いてランク付けされた文書リストを返す検索器を用いて情報にアクセスする。本研究では、検索エージェントがコーパス自体を検索環境として扱い、実行可能なシェルコマンドを発行することで証拠を見つけるという、相補的な視点を探求する。我々は、GrepSeekを提案する。これは最適化された直接コーパス対話型（DCI）検索エージェントであり、大規模テキストコーパスから証拠を見つけ、フィルタリングし、構成するためのコンパクトな検索エージェントを訓練する。大規模コーパス上で強化学習を用いて直接的に行動学習を行う際の不安定性に対処するため、二段階の訓練パイプラインを提案する。第一段階では、解答を認識するTutorと解答を参照しないPlannerを用いて、検証済みで因果的に根拠付けられた検索軌跡を生成し、コールドスタート用データセットを構築する。第二段階では、Group Relative Policy Optimization（GRPO）を用いて初期化された方策を洗練し、エージェントがコーパスとの直接的な相互作用を通じてタスク指向の検索行動を改善できるようにする。さらに、DCIを大規模に実用的にするため、意味を保持するシャーディング並列実行エンジンを使用する。これにより、シェルベースの検索を最大7.6倍高速化しつつ、シェルコマンドの逐次実行とバイトレベルの完全等価性を維持する。7つのオープンドメイン質問応答ベンチマークにおける実験結果は、GrepSeekが全体的に最も優れたトークンレベルのF1スコアとExact Matchを達成することを示している。また、我々の分析は、表面形式の変動が大きいクエリに対する純粋な字句的相互作用の限界を明らかにし、現実世界において既存の検索パラダイムを補完できる実用的で競争力のある検索エージェント手法としてのDCIの可能性を示唆している。

English

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6times while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level F_1 and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.