BRIGHT:一个逼真且具有挑战性的基准,用于依赖推理的检索。
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
July 16, 2024
作者: Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu
cs.AI
摘要
现有的检索基准主要由信息检索查询(例如,来自搜索引擎的聚合问题)组成,其中关键词或基于语义的检索通常足够。然而,许多复杂的现实世界查询需要深入推理,以识别超出表面形式匹配的相关文档。例如,查找编程问题的文档需要理解涉及的函数的逻辑和语法。为了更好地在这些具有挑战性的查询上进行检索基准测试,我们介绍了BRIGHT,这是第一个需要进行深入推理才能检索相关文档的文本检索基准。BRIGHT由来自不同领域(如经济学、心理学、机器人技术、软件工程、地球科学等)收集的1,398个现实世界查询构建而成,这些查询来源于自然发生或经过精心筛选的人类数据。广泛的评估显示,即使是最先进的检索模型在BRIGHT上表现不佳。在MTEB排行榜上领先的模型[38],在MTEB上取得了59.0的nDCG@10分数,在BRIGHT上仅获得了18.0的nDCG@10分数。我们进一步证明,通过使用大型语言模型(LLMs)生成的Chain-of-Thought推理来增强查询,性能可以提高多达12.2个点。此外,BRIGHT对于基准模型的预训练期间的数据泄漏具有鲁棒性,我们通过验证表明,即使将基准中的文档包含在训练数据中,性能也相似。我们相信BRIGHT为未来在更现实和具有挑战性的环境中的检索系统研究铺平了道路。我们的代码和数据可在https://brightbenchmark.github.io上获得。
English
Existing retrieval benchmarks primarily consist of information-seeking
queries (e.g., aggregated questions from search engines) where keyword or
semantic-based retrieval is usually sufficient. However, many complex
real-world queries require in-depth reasoning to identify relevant documents
that go beyond surface form matching. For example, finding documentation for a
coding question requires understanding the logic and syntax of the functions
involved. To better benchmark retrieval on such challenging queries, we
introduce BRIGHT, the first text retrieval benchmark that requires intensive
reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398
real-world queries collected from diverse domains (such as economics,
psychology, robotics, software engineering, earth sciences, etc.), sourced from
naturally occurring or carefully curated human data. Extensive evaluation
reveals that even state-of-the-art retrieval models perform poorly on BRIGHT.
The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0
nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate
that augmenting queries with Chain-of-Thought reasoning generated by large
language models (LLMs) improves performance by up to 12.2 points. Moreover,
BRIGHT is robust against data leakage during pretraining of the benchmarked
models as we validate by showing similar performance even when documents from
the benchmark are included in the training data. We believe that BRIGHT paves
the way for future research on retrieval systems in more realistic and
challenging settings. Our code and data are available at
https://brightbenchmark.github.io.Summary
AI-Generated Summary