RICO:通过视觉重建提升图像重描述任务的准确性与完整性
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
May 28, 2025
作者: Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
cs.AI
摘要
图像重描述技术被广泛应用于生成高质量的多模态任务训练数据集。现有的重描述方法通常依赖于强大的多模态大语言模型(MLLMs)来增强文本描述,但往往因幻觉和细粒度细节缺失导致不准确和不完整。为解决这些局限,我们提出了RICO,一种通过视觉重构优化描述的新框架。具体而言,我们利用文本到图像模型将描述重构为参考图像,并提示MLLM识别原始图像与重构图像间的差异,以此精炼描述。这一过程迭代进行,逐步推动生成更为忠实且全面的描述。为减轻迭代过程带来的额外计算成本,我们引入了RICO-Flash,它通过学习使用DPO生成类似RICO的描述。大量实验表明,我们的方法显著提升了描述的准确性和完整性,在CapsBench和CompreCap上均以约10%的优势超越了多数基线。代码已发布于https://github.com/wangyuchi369/RICO。
English
Image recaptioning is widely used to generate training datasets with enhanced
quality for various multimodal tasks. Existing recaptioning methods typically
rely on powerful multimodal large language models (MLLMs) to enhance textual
descriptions, but often suffer from inaccuracies due to hallucinations and
incompleteness caused by missing fine-grained details. To address these
limitations, we propose RICO, a novel framework that refines captions through
visual reconstruction. Specifically, we leverage a text-to-image model to
reconstruct a caption into a reference image, and prompt an MLLM to identify
discrepancies between the original and reconstructed images to refine the
caption. This process is performed iteratively, further progressively promoting
the generation of more faithful and comprehensive descriptions. To mitigate the
additional computational cost induced by the iterative process, we introduce
RICO-Flash, which learns to generate captions like RICO using DPO. Extensive
experiments demonstrate that our approach significantly improves caption
accuracy and completeness, outperforms most baselines by approximately 10% on
both CapsBench and CompreCap. Code released at
https://github.com/wangyuchi369/RICO.Summary
AI-Generated Summary