RICO:通過視覺重建提升圖像重描述的精確度與完整性
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
May 28, 2025
作者: Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
cs.AI
摘要
圖像重標註技術被廣泛應用於生成高質量的訓練數據集,以支持多種多模態任務。現有的重標註方法通常依賴於強大的多模態大語言模型(MLLMs)來增強文本描述,但往往因幻覺現象和細粒度細節缺失而導致描述不準確或不完整。為解決這些問題,我們提出了RICO,一個通過視覺重建來精煉標註的新框架。具體而言,我們利用文本到圖像模型將標註重建為參考圖像,並提示MLLM識別原始圖像與重建圖像之間的差異,從而精煉標註。這一過程迭代進行,進一步逐步促進生成更忠實且全面的描述。為減輕迭代過程帶來的額外計算成本,我們引入了RICO-Flash,它學習使用DPO生成類似RICO的標註。大量實驗表明,我們的方法顯著提高了標註的準確性和完整性,在CapsBench和CompreCap上均優於大多數基線約10%。代碼已發佈於https://github.com/wangyuchi369/RICO。
English
Image recaptioning is widely used to generate training datasets with enhanced
quality for various multimodal tasks. Existing recaptioning methods typically
rely on powerful multimodal large language models (MLLMs) to enhance textual
descriptions, but often suffer from inaccuracies due to hallucinations and
incompleteness caused by missing fine-grained details. To address these
limitations, we propose RICO, a novel framework that refines captions through
visual reconstruction. Specifically, we leverage a text-to-image model to
reconstruct a caption into a reference image, and prompt an MLLM to identify
discrepancies between the original and reconstructed images to refine the
caption. This process is performed iteratively, further progressively promoting
the generation of more faithful and comprehensive descriptions. To mitigate the
additional computational cost induced by the iterative process, we introduce
RICO-Flash, which learns to generate captions like RICO using DPO. Extensive
experiments demonstrate that our approach significantly improves caption
accuracy and completeness, outperforms most baselines by approximately 10% on
both CapsBench and CompreCap. Code released at
https://github.com/wangyuchi369/RICO.Summary
AI-Generated Summary