BRIGHT: 推論集約型検索のための現実的で挑戦的なベンチマーク

要旨

既存の検索ベンチマークは主に、情報探索型のクエリ（例えば、検索エンジンから集約された質問）で構成されており、キーワードや意味ベースの検索で通常は十分です。しかし、多くの複雑な現実世界のクエリでは、表層的なマッチングを超えた関連文書を特定するために、深い推論が必要とされます。例えば、コーディングに関する質問のドキュメントを見つけるためには、関連する関数のロジックと構文を理解する必要があります。このような挑戦的なクエリに対する検索をより適切にベンチマークするために、我々はBRIGHTを導入します。BRIGHTは、集中的な推論を必要とする最初のテキスト検索ベンチマークです。BRIGHTは、経済学、心理学、ロボティクス、ソフトウェア工学、地球科学など多様な分野から収集された1,398の現実世界のクエリから構築されており、自然発生または慎重にキュレーションされた人間のデータに基づいています。広範な評価により、最先端の検索モデルでさえBRIGHTでは低い性能しか発揮しないことが明らかになりました。MTEBリーダーボードで最高スコアである59.0 nDCG@10を達成したモデル[38]も、BRIGHTでは18.0 nDCG@10しか得られませんでした。さらに、大規模言語モデル（LLM）によって生成されたChain-of-Thought推論をクエリに追加することで、最大12.2ポイントの性能向上が得られることを示しました。また、BRIGHTは、ベンチマークされたモデルの事前学習中にデータ漏洩が発生しても堅牢であり、ベンチマークの文書がトレーニングデータに含まれている場合でも同様の性能が得られることを検証しました。我々は、BRIGHTがより現実的で挑戦的な設定における検索システムの未来の研究の道を開くと信じています。コードとデータはhttps://brightbenchmark.github.ioで公開されています。

English

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

BRIGHT: 推論集約型検索のための現実的で挑戦的なベンチマーク

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

要旨

Support