CapArena:大语言模型时代下的精细图像描述基准测试与分析
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
March 16, 2025
作者: Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, Jiajun Chen
cs.AI
摘要
图像描述一直是视觉语言研究领域的一项长期挑战。随着大型语言模型(LLMs)的兴起,现代视觉语言模型(VLMs)能够生成细致且全面的图像描述。然而,如何评估这些描述的质量仍是一个未解之题。本文聚焦于两个关键问题:(1)当前的VLMs在图像描述任务上实际表现如何,尤其是与人类相比?我们构建了CapArena平台,包含超过6000对描述对比及高质量的人类偏好投票。我们的竞技场式评估标志着一个里程碑,表明领先模型如GPT-4o已达到甚至超越人类水平,而大多数开源模型则表现欠佳。(2)自动化指标能否可靠评估详细描述的质量?利用CapArena中的人类标注,我们评估了传统及近期的描述指标,以及“VLM作为评判者”的方法。分析显示,尽管某些指标(如METEOR)在描述级别与人类判断有较好一致性,但其系统性偏差导致模型排名不一致。相比之下,“VLM作为评判者”在描述和模型两个层面均展现出强大的辨别力。基于这些洞见,我们发布了CapArena-Auto,一个准确高效的自动化详细描述基准,仅需每测试4美元,即可实现与人类排名94.3%的相关性。数据与资源将在https://caparena.github.io开源。
English
Image captioning has been a longstanding challenge in vision-language
research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate
detailed and comprehensive image descriptions. However, benchmarking the
quality of such captions remains unresolved. This paper addresses two key
questions: (1) How well do current VLMs actually perform on image captioning,
particularly compared to humans? We built CapArena, a platform with over 6000
pairwise caption battles and high-quality human preference votes. Our
arena-style evaluation marks a milestone, showing that leading models like
GPT-4o achieve or even surpass human performance, while most open-source models
lag behind. (2) Can automated metrics reliably assess detailed caption quality?
Using human annotations from CapArena, we evaluate traditional and recent
captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while
some metrics (e.g., METEOR) show decent caption-level agreement with humans,
their systematic biases lead to inconsistencies in model ranking. In contrast,
VLM-as-a-Judge demonstrates robust discernment at both the caption and model
levels. Building on these insights, we release CapArena-Auto, an accurate and
efficient automated benchmark for detailed captioning, achieving 94.3%
correlation with human rankings at just $4 per test. Data and resources will be
open-sourced at https://caparena.github.io.Summary
AI-Generated Summary