ViGoR：通過精細細分的獎勵建模改進大視覺語言模型的視覺定位

摘要

通過結合自然語言理解和大型語言模型的生成能力以及對圖像感知的廣泛知識，最近的大型視覺語言模型（LVLMs）展示了在現實世界中前所未有的推理能力。然而，生成的文本通常存在於視覺輸入中不準確的基礎，導致錯誤，如幻覺不存在的場景元素、遺漏場景的重要部分，以及推斷對象之間的屬性和關係不正確。為了解決這些問題，我們引入了一個新穎的框架，ViGoR（通過精細獎勵建模實現視覺基礎）。該框架利用精細獎勵建模，顯著增強了LVLMs對預訓練基線的視覺基礎。這種改進是通過更便宜的人工評估方法以及自動化方法高效實現的，而不是通過完整的監督。我們通過多個基準測試展示了我們方法的有效性。此外，我們構建了一個專門設計來驗證LVLMs視覺基礎能力的全面且具有挑戰性的數據集。最後，我們計劃釋出我們的人工標註，其中包括約16,000張圖像和生成文本對以及精細評估，以促進社區中相關研究的發展。

English

By combining natural language understanding and the generation capabilities and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented reasoning capabilities in the real world. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucinating nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through numerous metrics on several benchmarks. Additionally, we construct a comprehensive and challenging dataset specifically designed to validate the visual grounding capabilities of LVLMs. Finally, we plan to release our human annotation comprising approximately 16,000 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

ViGoR：通過精細細分的獎勵建模改進大視覺語言模型的視覺定位

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

摘要

Support