RL-Index: 검색 색인 추론을 위한 강화 학습

초록

실제 세계의 과제를 해결하기 위해서는 외부 지식의 검색이 필수적이지만, 질의와 관련 지식 간의 관계가 표면적 의미나 어휘 일치(예: 동일한 정리에 의존하는 수학 문제나 깊은 추론이 필요한 코딩)를 넘어 암시적이고 복잡한 추론을 포함할 때는 여전히 어려움이 따른다. 기존 접근법은 주로 질의 측면의 추론(예: 질의 재작성)에 의존하는데, 이는 상당한 온라인 지연 시간을 초래하고 지식 코퍼스 자체에 대한 추론(즉, 인덱스 측면 추론)을 수행할 기회를 충분히 활용하지 못한다. 본 논문에서는 검색 인덱스 추론을 강화 학습 문제로 정식화하는 에이전틱 인덱싱 프레임워크인 RL-Index를 제안한다. RL-Index는 질의 시점에 추론을 수행하는 대신, 잠재적인 질의-지식 관계를 명시적으로 인코딩하는 LLM 생성 근거를 문서에 추가함으로써 추론을 인덱싱 단계로 전환한다. 이러한 근거의 품질을 최적화하기 위해 GRPO(Group Relative Policy Optimization)를 활용하고 검색 유사도를 검증 가능한 보상 신호로 사용하여 검색 효과를 위한 인덱싱 결정을 직접 최적화할 수 있게 한다. BRIGHT 벤치마크에 대한 광범위한 실험을 통해 RL-Index가 검색 및 하위 질문 응답 성능을 일관되게 향상시키는 동시에 온라인 추론 지연 시간을 크게 줄임을 입증한다. 또한, 학습된 근거 추가는 다양한 검색기와 생성기에서 일반화되어, 서로 다른 검색 시스템에서 플러그 앤 플레이 방식의 인덱싱 전략으로서의 강건성을 강조한다.

English

Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.