ChatPaper.aiChatPaper

BRIGHT:一個實際且具挑戰性的基準測試,針對需要大量推理的檢索。

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

July 16, 2024
作者: Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu
cs.AI

摘要

現有的檢索基準主要包含尋求資訊的查詢(例如從搜索引擎聚合的問題),在這些情況下,基於關鍵字或語義的檢索通常是足夠的。然而,許多複雜的現實世界查詢需要深入推理,以識別超越表面形式匹配的相關文件。例如,為了找到編碼問題的文檔,需要理解所涉函數的邏輯和語法。為了更好地在這些具有挑戰性的查詢上進行檢索基準測試,我們介紹了BRIGHT,這是第一個需要深入推理才能檢索相關文件的文本檢索基準。BRIGHT由從不同領域(如經濟學、心理學、機器人學、軟體工程、地球科學等)收集的1,398個現實世界查詢構建而成,這些查詢來自自然發生或經過精心策劃的人類數據。廣泛的評估顯示,即使是最先進的檢索模型在BRIGHT上表現不佳。在MTEB排行榜上領先的模型[38],在MTEB上達到59.0的nDCG@10分數,在BRIGHT上則產生了18.0的nDCG@10分數。我們進一步展示,通過使用大型語言模型(LLMs)生成的Chain-of-Thought推理來擴充查詢,可以使性能提高多達12.2個點。此外,BRIGHT對於在基準模型預訓練期間的數據洩漏具有魯棒性,我們通過展示即使在訓練數據中包含基準文檔時也表現出類似的性能來進行驗證。我們相信BRIGHT為未來在更現實和具有挑戰性的環境中進行檢索系統研究鋪平了道路。我們的代碼和數據可在https://brightbenchmark.github.io 上獲得。
English
Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

Summary

AI-Generated Summary

PDF92November 28, 2024