基於生成模型的可擴展上下文排序
Scalable In-context Ranking with Generative Models
October 6, 2025
作者: Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu
cs.AI
摘要
上下文排序(In-context Ranking, ICR)是信息检索(Information Retrieval, IR)领域的一种新兴范式,它通过将任务描述、候选文档及查询直接整合到模型输入提示中,并利用大型语言模型(LLMs)的上下文理解能力来识别相关文档。尽管该方法有效,但其效率问题尤为突出,特别是在候选文档列表增长时,注意力操作随上下文长度呈二次/超线性扩展,导致计算负担加重。为此,本文首先揭示了针对ICR微调的LLMs注意力机制中固有的可挖掘结构:(1)文档间块稀疏性:注意力在单个文档块内密集,而在不同文档间稀疏;(2)查询-文档块相关性:中间层中某些查询词对文档块的注意力分数与该文档的实际相关性高度相关。基于这些观察,我们提出了BlockRank(块状上下文排序),一种创新方法,通过(a)在架构上强制实施观察到的文档间块稀疏性,将注意力复杂度从二次降至线性,同时保持性能不变;(b)在微调过程中,利用辅助对比训练目标优化真实相关文档的查询-文档块相关性,提升检索注意力。在BEIR、MSMarco和NQ数据集上使用Mistral-7B进行的实验表明,FLARE Mistral在匹配或超越现有最先进的列表排序器及受控微调基线的同时,推理效率显著提升(对于100个MSMarco文档,效率提升4.7倍),并能优雅地扩展至长上下文短列表,约500个文档的上下文(约10万上下文长度)在一秒内完成,为ICR提供了一个可扩展且高效的解决方案。
English
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval
(IR), which leverages contextual understanding of LLMs by directly
incorporating the task description, candidate documents, and the query into the
model's input prompt and tasking the LLM to identify relevant document(s).
While it is effective, efficiency is a significant challenge in this paradigm,
especially as the candidate list grows due to quadratic/super-linear scaling of
attention operation with context length. To this end, this paper first
identifies inherent and exploitable structures in the attention of LLMs
finetuned for ICR: (1) inter-document block sparsity: attention is dense within
each document block but sparse across different documents in the context; and
(2) query-document block relevance: the attention scores from certain query
tokens to a document block in middle layers strongly correlate with that
document's actual relevance. Motivated by these observations, we introduce
BlockRank (Blockwise In-context Ranking), a novel method that adapts the
attention operation in an LLM by (a) architecturally enforcing the observed
inter-document block sparsity, reducing attention complexity from quadratic to
linear without loss in performance, and (b) optimizing query-document block
relevance for true relevant documents during fine-tuning using an auxiliary
contrastive training objective, improving retrieval in attention. Experiments
on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches
or outperforms existing SOTA listwise rankers and controlled fine-tuned
baseline while being significantly more efficient at inference (4.7x for 100
MSMarco documents in context) and scaling gracefully to long-context
shortlists, around 500 documents in-context (approximately 100K context length)
within a second, presenting a scalable and effective solution for ICR.