視覺事實檢查器：實現高保真度詳細標題生成

摘要

目前用於視覺內容的自動標題生成方法面臨著缺乏細節、內容幻覺和指示不清等挑戰。在這項工作中，我們提出了VisualFactChecker（VFC），這是一個靈活的無需訓練的流程，可為2D圖像和3D物體生成高保真度和詳細的標題。VFC包括三個步驟：1）提議，在這一步驟中，圖像到文本標題生成模型提出多個初始標題；2）驗證，在這一步驟中，一個大型語言模型（LLM）利用物體檢測和視覺問答模型等工具對提出的標題進行事實核查；3）標題生成，在這一步驟中，一個LLM通過總結標題提議和事實核查結果來生成最終標題。在這一步驟中，VFC可以靈活地按照複雜的指示生成各種風格的標題。我們使用四個指標進行全面的標題評估：1）用於圖像文本相似性的CLIP-Score；2）用於測量原始圖像和使用標題生成的文本到圖像模型重建的圖像之間相似性的CLIP-Image-Score；3）在亞馬遜機械土耳其上進行的人類研究；4）用於細粒度評估的GPT-4V。評估結果表明，VFC在COCO數據集上的2D圖像和Objaverse數據集上的3D資產方面，優於最先進的開源標題生成方法。我們的研究表明，通過將開源模型結合到一個流程中，我們可以實現與GPT-4V等專有模型相媲美的標題生成能力，盡管模型尺寸小了超過10倍。

English

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

視覺事實檢查器：實現高保真度詳細標題生成

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

摘要

Support