ChatPaper.aiChatPaper

视觉事实检查器:实现高保真度详细标题生成

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

April 30, 2024
作者: Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui
cs.AI

摘要

针对视觉内容的现有自动字幕方法面临诸如缺乏细节、内容幻觉和指令跟随不佳等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),这是一个灵活的无需训练的流程,可为2D图像和3D物体生成高保真和详细的字幕。VFC包括三个步骤:1)提议阶段,图像到文本字幕模型提出多个初始字幕;2)验证阶段,大型语言模型(LLM)利用对象检测和VQA模型等工具对提出的字幕进行事实核查;3)字幕生成阶段,LLM通过总结字幕提议和事实核查结果生成最终字幕。在这一步中,VFC能够灵活地按照复杂指令生成各种风格的字幕。我们使用四个指标进行全面的字幕评估:1)用于图像文本相似度的CLIP-Score;2)用于衡量原始图像与由文本到图像模型生成的重建图像之间的相似度的CLIP-Image-Score;3)在亚马逊机械土耳其进行的人类研究;4)用于细粒度评估的GPT-4V。评估结果显示,VFC在COCO数据集上的2D图像和Objaverse数据集上的3D资产方面胜过了最先进的开源字幕方法。我们的研究表明,通过将开源模型组合成一个流程,我们可以获得与专有模型(如GPT-4V)相媲美的字幕能力,尽管模型大小缩小了超过10倍。
English
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

Summary

AI-Generated Summary

PDF254December 8, 2024