ScaleCap:通過雙模態去偏實現推理時可擴展的圖像描述生成
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
June 24, 2025
作者: Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin
cs.AI
摘要
本文提出ScaleCap,一种在推理阶段可扩展的图像描述生成策略,旨在生成全面且细致的图像描述。高质量图像描述面临的关键挑战源于大规模视觉语言模型(LVLMs)的内在偏差:多模态偏差导致描述粒度失衡,对某些元素详细描述而对其他元素仅作粗略提及;语言偏差则引发对不存在对象的虚构描述。为解决这些问题,我们提出了一种可扩展的去偏差描述策略,该策略随着推理预算的增加,持续丰富并校准描述内容。具体而言,我们引入了两个新颖组件:启发式问答与对比句子评分。前者基于图像生成内容特定问题并予以解答,逐步将相关信息注入描述中;后者采用句子级离线对比解码,有效识别并消除由语言偏差引起的虚构内容。随着推理成本的增加,ScaleCap提出更多启发式问题,逐步捕捉更多视觉细节,生成更为准确、平衡且信息丰富的描述。广泛的多模态对齐实验验证了ScaleCap的有效性。使用ScaleCap标注450K张图像并将其用于LVLM预训练,在11个广泛使用的基准测试中均实现了性能的持续提升。此外,ScaleCap在两项额外任务中展示了生成描述的超凡丰富度与保真度:在视觉问答任务中以描述替代图像,以及通过描述重建图像以评估语义覆盖范围。代码已发布于https://github.com/Cooperx521/ScaleCap。
English
This paper presents ScaleCap, an inference-time scalable image captioning
strategy that generates comprehensive and detailed image captions. The key
challenges of high-quality image captioning lie in the inherent biases of
LVLMs: multimodal bias resulting in imbalanced descriptive granularity,
offering detailed accounts of some elements while merely skimming over others;
linguistic bias leading to hallucinated descriptions of non-existent objects.
To address these issues, we propose a scalable debiased captioning strategy,
which continuously enriches and calibrates the caption with increased inference
budget. Specifically, we propose two novel components: heuristic question
answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate
hallucinations caused by linguistic biases. With increased inference cost, more
heuristic questions are raised by ScaleCap to progressively capture additional
visual details, generating captions that are more accurate, balanced, and
informative. Extensive modality alignment experiments demonstrate the
effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them
for LVLM pretraining leads to consistent performance gains across 11 widely
used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity
of generated captions with two additional tasks: replacing images with captions
in VQA task, and reconstructing images from captions to assess semantic
coverage. Code is available at https://github.com/Cooperx521/ScaleCap.