CapArena: LLM時代における詳細な画像キャプショニングのベンチマーキングと分析

要旨

画像キャプショニングは、視覚と言語の研究において長年の課題となってきました。大規模言語モデル（LLM）の台頭に伴い、現代の視覚言語モデル（VLM）は詳細かつ包括的な画像記述を生成するようになりました。しかし、そのようなキャプションの品質をベンチマークすることは未解決のままです。本論文では、2つの重要な問いに取り組みます。(1) 現在のVLMは、特に人間と比較して、画像キャプショニングにおいて実際にどの程度の性能を発揮しているのか？私たちは、6000以上のペアワイズキャプションバトルと高品質な人間の嗜好投票を備えたプラットフォーム「CapArena」を構築しました。アリーナ形式の評価は、GPT-4oのような主要モデルが人間の性能を達成または凌駕する一方で、ほとんどのオープンソースモデルが遅れをとっていることを示す画期的な成果です。(2) 自動化されたメトリクスは、詳細なキャプションの品質を確実に評価できるのか？CapArenaからの人間のアノテーションを使用して、従来のキャプショニングメトリクスや最近のメトリクス、およびVLM-as-a-Judgeを評価しました。私たちの分析によると、一部のメトリクス（例：METEOR）は人間とのキャプションレベルの一致を示すものの、その体系的なバイアスがモデルランキングの不整合を引き起こします。一方、VLM-as-a-Judgeは、キャプションとモデルの両方のレベルで堅牢な識別力を示します。これらの知見に基づいて、私たちは詳細なキャプショニングのための正確で効率的な自動化ベンチマーク「CapArena-Auto」をリリースし、テストあたりわずか4ドルで人間のランキングとの94.3%の相関を達成しました。データとリソースはhttps://caparena.github.ioでオープンソース化されます。

English

Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

CapArena: LLM時代における詳細な画像キャプショニングのベンチマーキングと分析

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

要旨

Support