ChatPaper.aiChatPaper

CaptionQA:你的图像描述是否与图像本身同等实用?

CaptionQA: Is Your Caption as Useful as the Image Itself?

November 26, 2025
作者: Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu
cs.AI

摘要

圖像描述在多模態系統(如檢索、推薦和多步驟智能推理流程)中充當視覺內容的高效替代品。然而現有的評估方法忽略了一個根本問題:描述文本能否在實際下游任務中真正替代圖像?我們提出基於實用性的基準測試CaptionQA,通過描述文本對下游任務的支持程度來評估模型生成描述的質量。CaptionQA是一個可擴展的領域依賴型基準,涵蓋自然場景、文檔、電子商務和具身人工智能四大領域,每個領域均設有精細分類體系(25個頂級類別和69個子類別),用以識別領域特定任務所需的關鍵信息。該基準構建了33,027道密集標註的選擇題(平均每圖50.3題),這些問題明確需要視覺信息才能解答,可全面檢測描述的實用性。在我們的評估框架中,大型語言模型僅依據描述文本回答問題,直接衡量描述是否保留圖像層級的實用性且能被下游LLM有效利用。對前沿多模態大模型的評估顯示,圖像與其描述文本的實用性存在顯著差距:在傳統圖像問答基準表現相近的模型,其描述實用性最大降幅達32%。我們開源CaptionQA基準及可擴展流水線,代碼詳見https://github.com/bronyayang/CaptionQA。
English
Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
PDF41December 2, 2025