ViGoR:通過精細細分的獎勵建模改進大視覺語言模型的視覺定位
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
February 9, 2024
作者: Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li
cs.AI
摘要
通過結合自然語言理解和大型語言模型的生成能力以及對圖像感知的廣泛知識,最近的大型視覺語言模型(LVLMs)展示了在現實世界中前所未有的推理能力。然而,生成的文本通常存在於視覺輸入中不準確的基礎,導致錯誤,如幻覺不存在的場景元素、遺漏場景的重要部分,以及推斷對象之間的屬性和關係不正確。為了解決這些問題,我們引入了一個新穎的框架,ViGoR(通過精細獎勵建模實現視覺基礎)。該框架利用精細獎勵建模,顯著增強了LVLMs對預訓練基線的視覺基礎。這種改進是通過更便宜的人工評估方法以及自動化方法高效實現的,而不是通過完整的監督。我們通過多個基準測試展示了我們方法的有效性。此外,我們構建了一個專門設計來驗證LVLMs視覺基礎能力的全面且具有挑戰性的數據集。最後,我們計劃釋出我們的人工標註,其中包括約16,000張圖像和生成文本對以及精細評估,以促進社區中相關研究的發展。
English
By combining natural language understanding and the generation capabilities
and breadth of knowledge of large language models with image perception, recent
large vision language models (LVLMs) have shown unprecedented reasoning
capabilities in the real world. However, the generated text often suffers from
inaccurate grounding in the visual input, resulting in errors such as
hallucinating nonexistent scene elements, missing significant parts of the
scene, and inferring incorrect attributes and relationships between objects. To
address these issues, we introduce a novel framework, ViGoR (Visual Grounding
Through Fine-Grained Reward Modeling) that utilizes fine-grained reward
modeling to significantly enhance the visual grounding of LVLMs over
pre-trained baselines. This improvement is efficiently achieved using much
cheaper human evaluations instead of full supervisions, as well as automated
methods. We show the effectiveness of our approach through numerous metrics on
several benchmarks. Additionally, we construct a comprehensive and challenging
dataset specifically designed to validate the visual grounding capabilities of
LVLMs. Finally, we plan to release our human annotation comprising
approximately 16,000 images and generated text pairs with fine-grained
evaluations to contribute to related research in the community.