DenseFusion-1M:整合視覺專家以實現全面多模態感知
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
July 11, 2024
作者: Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan
cs.AI
摘要
現有的多模式大型語言模型(MLLMs)越來越強調對各種視覺元素的複雜理解,包括多個物體、文本信息和空間關係。它們對於全面視覺感知的發展取決於提供多樣化視覺元素和完整圖像描述的高質量圖像-文本數據集。然而,目前這種超詳細數據集的稀缺性阻礙了MLLM社區內的進展。瓶頸來自當前標題引擎有限的感知能力,無法提供完整準確的標註。為了促進MLLM在全面視覺感知方面的尖端研究,我們因此提出感知融合,使用低成本但高效的標題引擎進行完整準確的圖像描述。具體來說,感知融合將多樣的感知專家作為圖像先驗,提供對視覺元素的明確信息,並採用高效的MLLM作為中心支點,模擬先進MLLM的感知能力。我們從未經篩選的LAION數據集中精心選擇100萬張高度代表性圖像,並使用我們的引擎生成密集描述,稱為DenseFusion-1M。廣泛的實驗驗證了我們的引擎優於其對手,生成的數據集顯著提高了現有MLLM在各種視覺-語言基準測試中的感知和認知能力,特別是對高分辨率圖像的輸入。數據集和代碼可在https://github.com/baaivision/DenseFusion 公開獲得。
English
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize
complex understanding of various visual elements, including multiple objects,
text information, and spatial relations. Their development for comprehensive
visual perception hinges on the availability of high-quality image-text
datasets that offer diverse visual elements and throughout image descriptions.
However, the scarcity of such hyper-detailed datasets currently hinders
progress within the MLLM community. The bottleneck stems from the limited
perceptual capabilities of current caption engines, which fall short in
providing complete and accurate annotations. To facilitate the cutting-edge
research of MLLMs on comprehensive vision perception, we thereby propose
Perceptual Fusion, using a low-budget but highly effective caption engine for
complete and accurate image descriptions. Specifically, Perceptual Fusion
integrates diverse perception experts as image priors to provide explicit
information on visual elements and adopts an efficient MLLM as a centric pivot
to mimic advanced MLLMs' perception abilities. We carefully select 1M highly
representative images from uncurated LAION dataset and generate dense
descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments
validate that our engine outperforms its counterparts, where the resulting
dataset significantly improves the perception and cognition abilities of
existing MLLMs across diverse vision-language benchmarks, especially with
high-resolution images as inputs. The dataset and code are publicly available
at https://github.com/baaivision/DenseFusion.Summary
AI-Generated Summary