생성 모델을 활용한 확장 가능한 인컨텍스트 랭킹

초록

문맥 내 순위 결정(In-context Ranking, ICR)은 정보 검색(Information Retrieval, IR)의 새로운 패러다임으로, 대형 언어 모델(LLM)의 문맥 이해 능력을 활용하여 작업 설명, 후보 문서 및 쿼리를 모델의 입력 프롬프트에 직접 통합하고, LLM에게 관련 문서를 식별하도록 요청하는 방식이다. 이 방법은 효과적이지만, 특히 후보 목록이 증가함에 따라 주의 연산(attention operation)이 문맥 길이에 대해 2차/초선형적으로 확장되기 때문에 효율성은 중요한 과제로 남아 있다. 이를 위해, 본 논문은 먼저 ICR을 위해 미세 조정된 LLM의 주의 메커니즘에서 내재적이고 활용 가능한 구조를 식별한다: (1) 문서 간 블록 희소성(inter-document block sparsity): 각 문서 블록 내에서는 주의가 밀집되어 있지만, 서로 다른 문서 간에는 희소하다; (2) 쿼리-문서 블록 관련성(query-document block relevance): 중간 레이어에서 특정 쿼리 토큰에서 문서 블록으로의 주의 점수는 해당 문서의 실제 관련성과 강한 상관관계를 가진다. 이러한 관찰에 동기를 부여하여, 우리는 BlockRank(Blockwise In-context Ranking)라는 새로운 방법을 제안한다. 이 방법은 (a) 관찰된 문서 간 블록 희소성을 구조적으로 강제하여 주의 복잡도를 2차에서 선형으로 줄이면서도 성능 저하 없이, (b) 미세 조정 중 보조적인 대조 학습 목표를 사용하여 실제 관련 문서에 대한 쿼리-문서 블록 관련성을 최적화하여 주의 기반 검색을 개선한다. BEIR, MSMarco 및 NQ에서 Mistral-7B를 사용한 실험 결과, FLARE Mistral은 기존의 최첨단 리스트와이즈 순위 결정기 및 통제된 미세 조정 기준선과 동등하거나 더 나은 성능을 보이면서도 추론 시 훨씬 더 효율적(100개의 MSMarco 문서에 대해 4.7배)이고, 약 500개의 문서(약 100K 문맥 길이)를 포함한 긴 문맥 단축 목록에 대해 1초 이내로 확장 가능하여, ICR을 위한 확장 가능하고 효과적인 솔루션을 제시한다.

English

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

생성 모델을 활용한 확장 가능한 인컨텍스트 랭킹

Scalable In-context Ranking with Generative Models

초록

Support