在複雜檢索任務上基準測試信息檢索模型

摘要

大型語言模型（LLMs）是處理文本任務的強大且多功能的工具，它們實現了無數以往難以想像的應用。相比之下，檢索模型尚未出現如此強大的通用模型。要實現這一目標，檢索模型必須能夠執行複雜的檢索任務，這些任務中的查詢包含多個部分、約束或自然語言要求。這些任務代表了從現有大多數常用評估集中使用的簡單、單一方面的查詢的自然演進。隨著人們期望搜索系統處理更具體且往往更具挑戰性的信息請求，複雜查詢自然產生，這在人們使用基於LLM的信息系統的方式中得到了體現。儘管人們越來越希望檢索模型在複雜檢索任務中擴展其能力，但評估檢索模型在各種複雜任務上的能力的資源仍然有限。現有的少數資源範圍有限，且往往缺乏現實的設置，這使得很難了解檢索模型在複雜現實世界檢索任務中的真正能力。為了解決這一不足並推動下一代檢索模型的創新，我們構建了一組多樣化且現實的複雜檢索任務，並對一組具有代表性的最先進檢索模型進行了基準測試。此外，我們還探討了基於LLM的查詢擴展和重寫對檢索質量的影響。我們的結果顯示，即使是最好的模型在產生高質量檢索結果方面也面臨困難，所有任務中的最高平均nDCG@10僅為0.346，R@100僅為0.587。儘管LLM增強可以幫助較弱的模型，但最強的模型在所有重寫技術下的所有指標上均表現下降。

English

Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.

在複雜檢索任務上基準測試信息檢索模型

Benchmarking Information Retrieval Models on Complex Retrieval Tasks

摘要

Support