DenseFusion-1M：整合視覺專家以實現全面多模態感知

摘要

現有的多模式大型語言模型（MLLMs）越來越強調對各種視覺元素的複雜理解，包括多個物體、文本信息和空間關係。它們對於全面視覺感知的發展取決於提供多樣化視覺元素和完整圖像描述的高質量圖像-文本數據集。然而，目前這種超詳細數據集的稀缺性阻礙了MLLM社區內的進展。瓶頸來自當前標題引擎有限的感知能力，無法提供完整準確的標註。為了促進MLLM在全面視覺感知方面的尖端研究，我們因此提出感知融合，使用低成本但高效的標題引擎進行完整準確的圖像描述。具體來說，感知融合將多樣的感知專家作為圖像先驗，提供對視覺元素的明確信息，並採用高效的MLLM作為中心支點，模擬先進MLLM的感知能力。我們從未經篩選的LAION數據集中精心選擇100萬張高度代表性圖像，並使用我們的引擎生成密集描述，稱為DenseFusion-1M。廣泛的實驗驗證了我們的引擎優於其對手，生成的數據集顯著提高了現有MLLM在各種視覺-語言基準測試中的感知和認知能力，特別是對高分辨率圖像的輸入。數據集和代碼可在https://github.com/baaivision/DenseFusion 公開獲得。

English

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs. The dataset and code are publicly available at https://github.com/baaivision/DenseFusion.

DenseFusion-1M：整合視覺專家以實現全面多模態感知

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

摘要

Support