以文绘景：通过基准测试与对齐学习提升细粒度图像描述生成

摘要

图像描述长期以来一直是视觉理解中的核心任务，随着视觉-语言模型（VLMs）的最新进展，生成详细图像描述的能力得到了显著提升。然而，由于过时的评估指标和粗略的标注，详细图像描述的评估仍显不足。本文中，我们引入了DeCapBench及一种专为详细描述任务设计的新指标——DCScore。DCScore通过将响应解构为最小的自足单元，即原始信息单元，并逐一评估，来衡量幻觉和细粒度全面性。我们的评估显示，DCScore比其他基于规则或模型的指标更贴近人类判断。同时，DeCapBench在描述性任务上与VLM竞技场结果高度相关，超越了现有视觉-语言模型的基准。此外，我们提出了一种基于我们先进指标的自动细粒度反馈收集方法——FeedQuill，用于偏好优化，展示了在自动生成偏好数据上的强大泛化能力。在多个VLM上的大量实验表明，我们的方法不仅显著减少了幻觉，还在多个基准上提升了性能，实现了卓越的细节描述表现，并超越了GPT-4o。

English

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

以文绘景：通过基准测试与对齐学习提升细粒度图像描述生成

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

摘要

Support