IFIR: 専門領域情報検索における指示追従評価のための包括的ベンチマーク

要旨

我々は、専門分野における指示追従型情報検索（IR）を評価するための初の包括的ベンチマークであるIFIRを紹介する。IFIRは2,426の高品質な事例を含み、金融、法律、医療、科学文献の4つの専門領域にわたる8つのサブセットをカバーしている。各サブセットは、カスタマイズされた指示が重要な現実世界のシナリオを再現し、1つ以上の領域固有の検索タスクに対応している。IFIRは、異なるレベルの複雑さを持つ指示を組み込むことで、指示追従型検索能力の詳細な分析を可能にする。また、指示に従うモデルの性能をより正確かつ信頼性高く評価するために、新しいLLMベースの評価手法を提案する。LLMを含む15の最先端検索モデルを用いた広範な実験を通じて、現在のモデルが複雑で領域固有の指示に効果的に対応する上で重大な課題に直面していることを明らかにした。さらに、これらの制約を強調する詳細な分析を提供し、検索モデルの今後の発展を導くための貴重な洞察を提示する。

English

We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

IFIR: 専門領域情報検索における指示追従評価のための包括的ベンチマーク

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

要旨

Support