ChatPaper.aiChatPaper

IFIR:一個全面評估專家領域資訊檢索中指令遵循能力的基準

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

March 6, 2025
作者: Tingyu Song, Guo Gan, Mingsheng Shang, Yilun Zhao
cs.AI

摘要

我們推出了IFIR,這是首個專門用於評估專家領域中指令遵循信息檢索(IR)的綜合基準。IFIR包含2,426個高質量示例,涵蓋了金融、法律、醫療保健和科學文獻四個專業領域的八個子集。每個子集針對一個或多個特定領域的檢索任務,模擬了現實世界中定制指令至關重要的場景。IFIR通過引入不同複雜程度的指令,使得對指令遵循檢索能力的細緻分析成為可能。我們還提出了一種基於大語言模型(LLM)的新穎評估方法,以提供更精確和可靠的模型在遵循指令方面表現的評估。通過對15個前沿檢索模型(包括基於LLM的模型)進行廣泛實驗,我們的結果顯示,當前模型在有效遵循複雜、特定領域指令方面面臨顯著挑戰。我們進一步提供了深入分析以凸顯這些限制,為未來檢索器開發的進步提供了寶貴的見解。
English
We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

Summary

AI-Generated Summary

PDF212March 7, 2025