評判評判者:LLM生成相關性判斷集錦
Judging the Judges: A Collection of LLM-Generated Relevance Judgements
February 19, 2025
作者: Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz
cs.AI
摘要
利用大型語言模型(LLMs)進行相關性評估,為改進信息檢索(IR)、自然語言處理(NLP)及相關領域提供了充滿希望的機會。實際上,LLMs有望讓IR實驗者以目前所需人工勞動的一小部分,構建評估集合。這對於尚處知識有限的新興主題尤為有益,並能緩解在低資源情境下評估排名系統的挑戰,這些情境中尋找人工標註者往往困難重重。鑑於該領域近期的快速發展,關於LLMs作為評估者的許多問題仍有待解答。在需要進一步研究的方面中,我們可以列舉出相關性判斷生成流程中各個組件的影響,例如所使用的提示或選擇的LLM。
本文對大規模自動相關性判斷評估的結果進行了基準測試並予以報告,這是在SIGIR 2024上舉辦的LLMJudge挑戰賽,其中提出了多種相關性評估方法。具體而言,我們發布並對比了由參與挑戰的八個國際團隊生成的42個LLM標籤,這些標籤基於TREC 2023深度學習軌道的相關性判斷。鑑於其多樣性,這些自動生成的相關性判斷不僅能幫助社區研究由LLMs引起的系統性偏差,還能探索集成模型的有效性,分析不同模型與人工評估者之間的權衡,並推進改進自動化評估技術的方法論。發布的資源可通過以下鏈接獲取:
https://llm4eval.github.io/LLMJudge-benchmark/
English
Using Large Language Models (LLMs) for relevance assessments offers promising
opportunities to improve Information Retrieval (IR), Natural Language
Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing
IR experimenters to build evaluation collections with a fraction of the manual
human labor currently required. This could help with fresh topics on which
there is still limited knowledge and could mitigate the challenges of
evaluating ranking systems in low-resource scenarios, where it is challenging
to find human annotators. Given the fast-paced recent developments in the
domain, many questions concerning LLMs as assessors are yet to be answered.
Among the aspects that require further investigation, we can list the impact of
various components in a relevance judgment generation pipeline, such as the
prompt used or the LLM chosen.
This paper benchmarks and reports on the results of a large-scale automatic
relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where
different relevance assessment approaches were proposed. In detail, we release
and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track
relevance judgments produced by eight international teams who participated in
the challenge. Given their diverse nature, these automatically generated
relevance judgments can help the community not only investigate systematic
biases caused by LLMs but also explore the effectiveness of ensemble models,
analyze the trade-offs between different models and human assessors, and
advance methodologies for improving automated evaluation techniques. The
released resource is available at the following link:
https://llm4eval.github.io/LLMJudge-benchmark/Summary
AI-Generated Summary