GrepSeek: 말뭉치 직접 상호작용을 위한 검색 에이전트 학습

초록

대규모 언어 모델(LLM) 검색 에이전트는 여러 차례의 추론과 정보 검색을 통해 지식 집약적인 언어 작업에서 강력한 가능성을 보여주고 있다. 대부분의 기존 시스템은 키워드나 자연어 질의를 입력받아 사전 계산된 문서 표현 인덱스를 사용하여 문서의 순위 목록을 반환하는 검색기를 통해 정보에 접근한다. 본 연구에서는 검색 에이전트가 말뭉치 자체를 검색 환경으로 취급하고 실행 가능한 셸 명령어를 발행하여 증거를 찾는 보완적인 관점을 탐구한다. 우리는 최적화된 직접 말뭉치 상호작용(DCI) 검색 에이전트인 GrepSeek을 소개한다. GrepSeek은 대규모 텍스트 말뭉치에서 증거를 찾고, 필터링하며, 구성하는 소형 검색 에이전트를 학습시킨다. 대규모 말뭉치에서 강화 학습을 통해 직접 학습 행동을 수행할 때 발생하는 불안정성을 해결하기 위해, 우리는 두 단계의 훈련 파이프라인을 제안한다. 첫째, 답변 인식 튜터(Tutor)와 답변 블라인드 플래너(Planner)를 사용하여 냉시작 데이터셋을 구축하고, 검증되었으며 인과적으로 근거 있는 검색 궤적을 생성한다. 둘째, 그룹 상대 정책 최적화(GRPO)를 통해 초기화된 정책을 정교화하여, 에이전트가 말뭉치와의 직접 상호작용을 통해 작업 지향 검색 행동을 개선할 수 있도록 한다. DCI를 대규모로 실용적으로 만들기 위해, 우리는 의미를 보존하는 샤딩-병렬 실행 엔진을 추가로 사용하여 셸 기반 검색을 최대 7.6배까지 가속화하면서, 셸 명령어의 순차 실행과 바이트 단위의 정확한 동등성을 유지한다. 7개의 오픈 도메인 질의응답 벤치마크에 걸친 실험 결과, GrepSeek이 가장 강력한 전체 토큰 수준 F_1 및 정확 일치(Exact Match)를 달성함을 보여준다. 우리의 분석은 또한 표면 형태 변이가 큰 질의에 대한 순수 어휘 기반 상호작용의 한계를 강조하며, DCI가 실제 세계에서 기존 검색 패러다임을 보완할 수 있는 검색 에이전트를 위한 실용적이고 경쟁력 있는 방법임을 시사한다.

English

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6times while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level F_1 and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.