FollowIR: 情報検索モデルの指示追従能力の評価と教育

要旨

現代の大規模言語モデル（LLMs）は、多様なユーザータスクを可能にする長く複雑な指示に従う能力を持っています。しかし、情報検索（IR）モデルがそのアーキテクチャの基盤としてLLMsを使用しているにもかかわらず、ほとんどすべてのモデルは依然としてクエリのみを入力として受け取り、指示を含んでいません。指示を受け取る最近の少数のモデルについても、それらがどのように指示を使用しているかは不明です。私たちは、FollowIRというデータセットを紹介します。このデータセットには、IRモデルが現実世界の指示をより良く理解するためのトレーニングセットと、厳密な指示評価ベンチマークが含まれています。FollowIRは、TREC会議の長い歴史に基づいて構築されています。TRECでは、人間のアノテーターに文書の関連性を決定するための指示（ナラティブとも呼ばれる）を提供していますが、IRモデルもこれらの詳細な指示に基づいて関連性を理解し決定できるべきです。私たちの評価ベンチマークは、3つの深く評価されたTRECコレクションから始まり、アノテーターの指示を変更して関連文書を再アノテーションします。このプロセスを通じて、新しいペアワイズ評価フレームワークを使用して、IRモデルが指示にどれだけ従うかを測定できます。私たちの結果は、既存の検索モデルが指示を正しく使用できず、基本的なキーワードとして使用し、長文の情報を理解するのに苦労していることを示しています。しかし、IRモデルが複雑な指示に従うことを学ぶことは可能です。私たちの新しいFollowIR-7Bモデルは、トレーニングセットでのファインチューニング後に大幅な改善（13％以上）を示しています。

English

Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.

FollowIR: 情報検索モデルの指示追従能力の評価と教育

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

要旨

Support