基于生成模型的可扩展上下文排序

摘要

上下文排序（In-context Ranking, ICR）是信息检索（IR）领域的一种新兴范式，它通过将任务描述、候选文档及查询直接融入大语言模型（LLM）的输入提示中，并让LLM识别相关文档，从而利用其上下文理解能力。尽管ICR效果显著，但效率问题成为该范式的一大挑战，尤其是随着候选列表的扩展，注意力操作随上下文长度呈二次或超线性增长，导致计算负担加重。为此，本文首先揭示了针对ICR微调后的LLM注意力机制中固有的可挖掘结构：（1）文档间块稀疏性：注意力在单个文档块内密集，而在不同文档间稀疏；（2）查询-文档块相关性：中间层中某些查询词对文档块的注意力分数与该文档的实际相关性高度相关。基于这些观察，我们提出了BlockRank（块级上下文排序），一种创新方法，通过（a）在架构上强制实施观察到的文档间块稀疏性，将注意力复杂度从二次降至线性而不牺牲性能，以及（b）在微调过程中利用辅助对比训练目标优化真实相关文档的查询-文档块相关性，提升检索注意力。在BEIR、MSMarco和NQ数据集上使用Mistral-7B进行的实验表明，FLARE Mistral不仅匹配或超越了现有最先进的列表排序器及受控微调基线，而且在推理效率上显著提升（对于100个MSMarco文档，速度提升4.7倍），并能优雅地扩展至长上下文短列表，约500个文档（约10万上下文长度）在1秒内完成处理，为ICR提供了一个可扩展且高效的解决方案。

English

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

基于生成模型的可扩展上下文排序

Scalable In-context Ranking with Generative Models

摘要

Support