ChatPaper.aiChatPaper

CapArena:大語言模型時代下的細粒度圖像描述基準測試與分析

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

March 16, 2025
作者: Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, Jiajun Chen
cs.AI

摘要

圖像描述一直是視覺語言研究中的長期挑戰。隨著大型語言模型(LLMs)的崛起,現代視覺語言模型(VLMs)能夠生成詳細且全面的圖像描述。然而,如何評估這些描述的質量仍然是一個未解決的問題。本文探討了兩個關鍵問題:(1)當前的VLMs在圖像描述任務上的實際表現如何,特別是與人類相比?我們構建了CapArena平台,包含超過6000對描述對比和高質量的人類偏好投票。我們的競技場式評估標誌著一個里程碑,顯示領先模型如GPT-4o已達到甚至超越人類表現,而大多數開源模型則落後。(2)自動化指標能否可靠地評估詳細描述的質量?利用CapArena中的人類註釋,我們評估了傳統和最新的描述指標,以及VLM-as-a-Judge。我們的分析表明,雖然某些指標(如METEOR)在描述層面與人類有較好的一致性,但其系統性偏差導致模型排名不一致。相比之下,VLM-as-a-Judge在描述和模型層面均展現出強大的辨別能力。基於這些洞察,我們發布了CapArena-Auto,一個精確且高效的自動化詳細描述基準,僅需每測試4美元即可實現與人類排名94.3%的相關性。數據和資源將在https://caparena.github.io開源。
English
Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

Summary

AI-Generated Summary

PDF252March 19, 2025