ConvSearch-R1：通過強化學習增強對話式搜索中的查詢重構推理能力

摘要

對話式搜索系統需要有效處理那些通常包含歧義、省略和指代等上下文依賴的查詢。對話式查詢重構（CQR）通過將這些查詢轉化為適合現成檢索器的自包含形式來應對這一挑戰。然而，現有的CQR方法面臨兩個關鍵限制：高度依賴於昂貴的外部監督（來自人工註釋或大型語言模型），以及重寫模型與下游檢索器之間對齊不足。我們提出了ConvSearch-R1，這是首個完全消除對外部重寫監督依賴的自驅動框架，通過強化學習直接利用檢索信號來優化重構。我們的新穎兩階段方法結合了自驅動策略預熱，通過檢索引導的自蒸餾解決冷啟動問題，隨後採用檢索引導的強化學習，並設計了專門的排名激勵獎勵塑造機制，以解決傳統檢索指標中的稀疏性問題。在TopiOCQA和QReCC數據集上的廣泛實驗表明，ConvSearch-R1顯著超越了先前的最先進方法，在具有挑戰性的TopiOCQA數據集上實現了超過10%的提升，同時使用更小的3B參數模型且無需任何外部監督。

English

Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.

ConvSearch-R1：通過強化學習增強對話式搜索中的查詢重構推理能力

ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

摘要

Support