在复杂检索任务上对信息检索模型进行基准测试

摘要

大型语言模型（LLMs）在处理文本任务方面展现了惊人的多功能性，催生了无数以往难以想象的应用。相比之下，检索模型尚未出现如此强大的通用模型。要实现这一目标，检索模型必须能够执行复杂的检索任务，其中查询包含自然语言中的多个部分、约束或要求。这些任务代表了从现有大多数常用评估集中使用的简单、单一维度查询的自然演进。随着人们期望搜索系统能够处理更具体且往往更具挑战性的信息请求，复杂查询自然产生，这一点在人们使用基于LLM的信息系统时得到了体现。尽管对检索模型在复杂检索任务中扩展能力的渴望日益增长，但评估检索模型在多样化复杂任务集上能力的资源仍然有限。现有的少数资源往往范围有限，且缺乏真实场景设置，难以了解检索模型在复杂现实世界检索任务中的真实能力。为弥补这一不足并推动下一代检索模型的创新，我们构建了一个多样且真实的复杂检索任务集，并对一系列具有代表性的最先进检索模型进行了基准测试。此外，我们还探讨了基于LLM的查询扩展和重写对检索质量的影响。我们的结果表明，即使是最好的模型在高质量检索结果方面也面临挑战，所有任务中的平均nDCG@10最高仅为0.346，R@100最高仅为0.587。虽然LLM增强可以帮助较弱的模型，但最强模型在所有重写技术下的各项指标均有所下降。

English

Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.

在复杂检索任务上对信息检索模型进行基准测试

Benchmarking Information Retrieval Models on Complex Retrieval Tasks

摘要

Support