BRIGHT: 추론 집약적 검색을 위한 현실적이고 도전적인 벤치마크

초록

기존의 검색 벤치마크는 주로 정보 탐색 쿼리(예: 검색 엔진에서 수집된 질문들)로 구성되어 있으며, 이 경우 키워드 또는 의미 기반 검색이 일반적으로 충분합니다. 그러나 복잡한 현실 세계의 쿼리 중 많은 부분은 표면적인 형태 매칭을 넘어서는 심층적인 추론이 필요한 경우가 많습니다. 예를 들어, 코딩 질문에 대한 문서를 찾기 위해서는 관련 함수의 논리와 구문을 이해해야 합니다. 이러한 도전적인 쿼리에 대한 검색 성능을 더 잘 평가하기 위해, 우리는 관련 문서를 검색하는 데 집중적인 추론이 필요한 최초의 텍스트 검색 벤치마크인 BRIGHT를 소개합니다. BRIGHT는 경제학, 심리학, 로보틱스, 소프트웨어 공학, 지구과학 등 다양한 분야에서 수집된 1,398개의 실제 쿼리로 구성되었으며, 이는 자연스럽게 발생하거나 신중하게 선별된 인간 데이터에서 가져온 것입니다. 광범위한 평가 결과, 최첨단 검색 모델들조차 BRIGHT에서 낮은 성능을 보이는 것으로 나타났습니다. MTEB 리더보드에서 선두를 달리고 있는 모델[38]은 59.0의 nDCG@10 점수를 기록했지만, BRIGHT에서는 nDCG@10 점수가 18.0에 그쳤습니다. 우리는 또한 대형 언어 모델(LLM)이 생성한 Chain-of-Thought 추론을 쿼리에 추가함으로써 성능을 최대 12.2점까지 향상시킬 수 있음을 보여줍니다. 더욱이, BRIGHT는 벤치마크 모델의 사전 학습 중 데이터 누출에 대해 강건한데, 이는 벤치마크의 문서가 훈련 데이터에 포함된 경우에도 유사한 성능을 보이는 것으로 검증되었습니다. 우리는 BRIGHT가 더 현실적이고 도전적인 환경에서의 검색 시스템 연구를 위한 길을 열어줄 것이라고 믿습니다. 우리의 코드와 데이터는 https://brightbenchmark.github.io에서 확인할 수 있습니다.

English

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

BRIGHT: 추론 집약적 검색을 위한 현실적이고 도전적인 벤치마크

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

초록

Support