ReasonRank: 強力な推論能力によるパッセージランキングの強化

要旨

大規模言語モデル（LLM）に基づくリストワイズランキングは、多くのパッセージランキングタスクで優れた性能を示しています。大規模推論モデルの発展に伴い、テスト時のステップバイステップ推論がリストワイズランキングの性能向上に役立つことが多くの研究で実証されています。しかし、推論集約型のトレーニングデータの不足により、既存のリランカーは多くの複雑なランキングシナリオで性能が低く、推論集約型リランカーのランキング能力はまだ十分に開発されていません。本論文では、まず、多様なドメインからトレーニングクエリとパッセージを収集し、DeepSeek-R1を適用して高品質なトレーニングラベルを生成する自動化された推論集約型トレーニングデータ合成フレームワークを提案します。データ品質を確保するために、自己整合性データフィルタリングメカニズムを設計しました。リストワイズリランカーに強力な推論能力を付与するために、さらに、推論パターン学習のためのコールドスタート教師ありファインチューニング（SFT）ステージと、ランキング能力をさらに強化するための強化学習（RL）ステージを含む2段階のポストトレーニングアプローチを提案します。RLステージでは、リストワイズランキングの性質に基づいて、ランキングメトリックベースの報酬よりも効果的なマルチビューレンキング報酬を設計しました。大規模な実験により、私たちがトレーニングした推論集約型リランカーReasonRankが既存のベースラインを大幅に上回り、ポイントワイズリランカーRank1よりもはるかに低いレイテンシを達成することが実証されました。さらに実験を重ねた結果、私たちのReasonRankはBRIGHTリーダーボードで40.6の最新技術（SOTA）性能を達成しました\footnote{https://brightbenchmark.github.io/.}。私たちのコードはhttps://github.com/8421BCD/ReasonRankで公開されています。

English

Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker ReasonRank outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/.} Our codes are available at https://github.com/8421BCD/ReasonRank.

ReasonRank: 強力な推論能力によるパッセージランキングの強化

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

要旨

Support