生成モデルを用いたスケーラブルなインコンテキストランキング

要旨

文脈内ランキング（ICR）は、情報検索（IR）における新たなパラダイムであり、LLM（大規模言語モデル）の文脈理解を活用して、タスクの説明、候補文書、およびクエリを直接モデルの入力プロンプトに組み込み、LLMに適切な文書を特定させるものです。この手法は有効ですが、特に候補リストが増加するにつれて、注意機構の計算量が二次的または超線形的に増加するため、効率性が大きな課題となっています。この問題に対処するため、本論文ではまず、ICR用にファインチューニングされたLLMの注意機構に内在する構造を特定します：（1）文書間ブロックスパース性：各文書ブロック内では注意が密であるが、異なる文書間では疎であること、（2）クエリ-文書ブロック関連性：中間層における特定のクエリトークンから文書ブロックへの注意スコアが、その文書の実際の関連性と強く相関することです。これらの観察に基づき、我々はBlockRank（ブロックワイズ文脈内ランキング）を提案します。これは、（a）観察された文書間ブロックスパース性をアーキテクチャ的に強制し、性能を損なうことなく注意の計算量を二次的から線形に削減し、（b）補助的なコントラスティブ学習目的を用いて、ファインチューニング中に真の関連文書に対するクエリ-文書ブロック関連性を最適化し、注意機構における検索性能を向上させる新しい手法です。BEIR、MSMarco、およびNQにおけるMistral-7Bを用いた実験では、FLARE Mistralが既存のSOTAリストワイズランカーや制御されたファインチューニングベースラインと同等またはそれ以上の性能を示し、推論時の効率性が大幅に向上（100件のMSMarco文書に対して4.7倍）し、長文脈のショートリスト（約500件の文書、約10万トークンの文脈長）においても1秒以内にスケーラブルに動作することが確認されました。これにより、ICRのためのスケーラブルで効果的なソリューションが提示されています。

English

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

生成モデルを用いたスケーラブルなインコンテキストランキング

Scalable In-context Ranking with Generative Models

要旨

Support