RICO：通过视觉重建提升图像重描述任务的准确性与完整性

摘要

图像重描述技术被广泛应用于生成高质量的多模态任务训练数据集。现有的重描述方法通常依赖于强大的多模态大语言模型（MLLMs）来增强文本描述，但往往因幻觉和细粒度细节缺失导致不准确和不完整。为解决这些局限，我们提出了RICO，一种通过视觉重构优化描述的新框架。具体而言，我们利用文本到图像模型将描述重构为参考图像，并提示MLLM识别原始图像与重构图像间的差异，以此精炼描述。这一过程迭代进行，逐步推动生成更为忠实且全面的描述。为减轻迭代过程带来的额外计算成本，我们引入了RICO-Flash，它通过学习使用DPO生成类似RICO的描述。大量实验表明，我们的方法显著提升了描述的准确性和完整性，在CapsBench和CompreCap上均以约10%的优势超越了多数基线。代码已发布于https://github.com/wangyuchi369/RICO。

English

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

RICO：通过视觉重建提升图像重描述任务的准确性与完整性

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

摘要

Support