ChatPaper.aiChatPaper

标题问答:您的图像描述是否与图像本身同等实用?

CaptionQA: Is Your Caption as Useful as the Image Itself?

November 26, 2025
作者: Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu
cs.AI

摘要

图像标题在多模态系统(如检索、推荐和多步智能推理流程)中作为视觉内容的高效替代品。然而当前的评估方法忽略了一个根本问题:标题能否在实际下游任务中真正替代图像?我们提出基于实用性的基准测试CaptionQA,通过标题对下游任务的支持程度来衡量模型生成标题的质量。该基准涵盖自然图像、文档、电子商务和具身AI四大领域,包含细粒度分类体系(25个主类与69个子类),可识别领域任务所需的关键信息。CaptionQA构建了33,027道密集标注的多选题(平均每图50.3题),这些问题需依赖视觉信息作答,能全面检验标题的实用性。在我们的评估框架中,大语言模型仅依据标题回答问题,直接衡量标题是否保留图像级效用且能被下游LLM有效利用。对前沿多模态大模型的评估显示,图像与其标题效用间存在显著差距:在传统图像QA基准表现相近的模型,其标题效用最大下降达32%。我们开源CaptionQA基准及可扩展至新领域的流水线代码(https://github.com/bronyayang/CaptionQA)。
English
Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
PDF41December 2, 2025