評判評判者：LLM生成相關性判斷集錦

摘要

利用大型語言模型（LLMs）進行相關性評估，為改進信息檢索（IR）、自然語言處理（NLP）及相關領域提供了充滿希望的機會。實際上，LLMs有望讓IR實驗者以目前所需人工勞動的一小部分，構建評估集合。這對於尚處知識有限的新興主題尤為有益，並能緩解在低資源情境下評估排名系統的挑戰，這些情境中尋找人工標註者往往困難重重。鑑於該領域近期的快速發展，關於LLMs作為評估者的許多問題仍有待解答。在需要進一步研究的方面中，我們可以列舉出相關性判斷生成流程中各個組件的影響，例如所使用的提示或選擇的LLM。本文對大規模自動相關性判斷評估的結果進行了基準測試並予以報告，這是在SIGIR 2024上舉辦的LLMJudge挑戰賽，其中提出了多種相關性評估方法。具體而言，我們發布並對比了由參與挑戰的八個國際團隊生成的42個LLM標籤，這些標籤基於TREC 2023深度學習軌道的相關性判斷。鑑於其多樣性，這些自動生成的相關性判斷不僅能幫助社區研究由LLMs引起的系統性偏差，還能探索集成模型的有效性，分析不同模型與人工評估者之間的權衡，並推進改進自動化評估技術的方法論。發布的資源可通過以下鏈接獲取： https://llm4eval.github.io/LLMJudge-benchmark/

English

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

評判評判者：LLM生成相關性判斷集錦

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

摘要

Support