裁判官を裁く：LLM生成の関連性判定コレクション

要旨

大規模言語モデル（LLM）を関連性評価に活用することは、情報検索（IR）、自然言語処理（NLP）、および関連分野の改善に向けた有望な機会を提供します。実際、LLMはIR実験者が現在必要とされる手作業の一部で評価コレクションを構築することを可能にする可能性を秘めています。これは、まだ知識が限られている新しいトピックに対処するのに役立ち、人間のアノテーターを見つけることが難しい低リソース環境でのランキングシステムの評価における課題を軽減する可能性があります。この分野の急速な進展を踏まえると、LLMを評価者として使用する際の多くの疑問がまだ未解決です。さらに調査が必要な側面として、プロンプトの選択や使用するLLMなど、関連性判断生成パイプラインにおけるさまざまなコンポーネントの影響が挙げられます。本論文では、SIGIR 2024で開催されたLLMJudgeチャレンジにおける大規模な自動関連性判断評価の結果をベンチマークし、報告します。具体的には、TREC 2023 Deep Learningトラックの関連性判断に対して、8つの国際チームが生成した42のLLMベースのラベルを公開し、ベンチマークを行いました。これらの自動生成された関連性判断は、その多様性から、コミュニティがLLMによって引き起こされる系統的なバイアスを調査するだけでなく、アンサンブルモデルの有効性を探求し、異なるモデルと人間の評価者の間のトレードオフを分析し、自動評価技術を改善する方法論を進展させるのに役立ちます。公開されたリソースは以下のリンクから利用可能です： https://llm4eval.github.io/LLMJudge-benchmark/

English

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

裁判官を裁く：LLM生成の関連性判定コレクション

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

要旨

Support