RICO：通過視覺重建提升圖像重描述的精確度與完整性

摘要

圖像重標註技術被廣泛應用於生成高質量的訓練數據集，以支持多種多模態任務。現有的重標註方法通常依賴於強大的多模態大語言模型（MLLMs）來增強文本描述，但往往因幻覺現象和細粒度細節缺失而導致描述不準確或不完整。為解決這些問題，我們提出了RICO，一個通過視覺重建來精煉標註的新框架。具體而言，我們利用文本到圖像模型將標註重建為參考圖像，並提示MLLM識別原始圖像與重建圖像之間的差異，從而精煉標註。這一過程迭代進行，進一步逐步促進生成更忠實且全面的描述。為減輕迭代過程帶來的額外計算成本，我們引入了RICO-Flash，它學習使用DPO生成類似RICO的標註。大量實驗表明，我們的方法顯著提高了標註的準確性和完整性，在CapsBench和CompreCap上均優於大多數基線約10%。代碼已發佈於https://github.com/wangyuchi369/RICO。

English

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

RICO：通過視覺重建提升圖像重描述的精確度與完整性

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

摘要

Support