以言繪圖:通過基準與對齊學習提升細緻圖像描述
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
March 10, 2025
作者: Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan
cs.AI
摘要
圖像描述長期以來一直是視覺理解中的關鍵任務,隨著視覺語言模型(VLMs)的最新進展,生成詳細圖像描述的能力得到了顯著提升。然而,由於過時的評估指標和粗糙的註釋,詳細圖像描述的評估仍然未被充分探索。在本文中,我們引入了DeCapBench以及一個新穎的指標DCScore,專門為詳細描述任務設計。DCScore通過將回應解構為最小的自足單元,稱為原始信息單元,並對其進行個別評估,來評估幻覺和細粒度的全面性。我們的評估顯示,DCScore比其他基於規則或基於模型的指標更接近人類判斷。同時,DeCapBench在描述性任務上與VLM競技場結果呈現高度相關性,超越了現有的視覺語言模型基準。此外,我們提出了一種自動細粒度反饋收集方法FeedQuill,基於我們的高級指標進行偏好優化,展示了在自動生成的偏好數據上具有強大的泛化能力。在多個VLM上的廣泛實驗表明,我們的方法不僅顯著減少了幻覺,還提升了在各種基準上的性能,實現了卓越的細節描述性能,並超越了GPT-4o。
English
Image captioning has long been a pivotal task in visual understanding, with
recent advancements in vision-language models (VLMs) significantly enhancing
the ability to generate detailed image captions. However, the evaluation of
detailed image captioning remains underexplored due to outdated evaluation
metrics and coarse annotations. In this paper, we introduce DeCapBench along
with a novel metric, DCScore, specifically designed for detailed captioning
tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by
deconstructing responses into the smallest self-sufficient units, termed
primitive information units, and assessing them individually. Our evaluation
shows that DCScore aligns more closely with human judgment than other
rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high
correlation with VLM arena results on descriptive tasks, surpassing existing
benchmarks for vision-language models. Additionally, we present an automatic
fine-grained feedback collection method, FeedQuill, for preference optimization
based on our advanced metric, showing robust generalization capabilities across
auto-generated preference data. Extensive experiments on multiple VLMs
demonstrate that our method not only significantly reduces hallucinations but
also enhances performance across various benchmarks, achieving superior detail
captioning performance while surpassing GPT-4o.Summary
AI-Generated Summary