RL-Index：検索インデックス推論のための強化学習

要旨

実世界のタスクを解決するためには外部知識の検索が不可欠であるが、クエリと関連知識の関係が表面的な意味的・語彙的一致を超えた暗黙的で複雑な推論（例えば同一の定理に依存する数学問題や深い推論を要するコーディング）を含む場合、依然として困難が伴う。既存のアプローチは主にクエリ側の推論（例：クエリ書き換え）に依存しており、これによりオンラインでの大幅なレイテンシが発生し、知識コーパス自体に対する推論（すなわちインデックス側の推論）を実行する機会が十分に活用されていない。本稿では、検索インデックスの推論を強化学習問題として定式化するエージェントベースのインデックス作成フレームワークであるRL-Indexを提案する。RL-Indexはクエリ時点での推論を行う代わりに、文書にLLMが生成した根拠（rationales）を追加し、それによって潜在的なクエリと知識の関係を明示的にエンコードすることで、推論をインデックス作成段階に移行する。これらの根拠の品質を最適化するために、Group Relative Policy Optimization（GRPO）を採用し、検索類似度を検証可能な報酬信号として利用することで、検索効果のためのインデックス作成決定を直接最適化する。BRIGHTベンチマークにおける広範な実験により、RL-Indexは検索性能と下流の質問応答性能の両方を一貫して向上させると同時に、オンライン推論レイテンシを大幅に削減することを示す。さらに、学習された根拠の追加は多様な検索器や生成器にわたって汎化し、異なる検索システムに対してプラグアンドプレイのインデックス作成戦略としてのロバスト性を強調する。

English

Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.