MG-LLaVA:朝向多粒度視覺指導調整
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
June 25, 2024
作者: Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang
cs.AI
摘要
多模式大型語言模型(MLLMs)在各種視覺理解任務中取得了顯著進展。然而,大多數這些模型僅能處理低分辨率圖像,這限制了它們在需要詳細視覺信息的感知任務中的有效性。在我們的研究中,我們提出了MG-LLaVA,這是一種創新的MLLM,通過整合多粒度視覺流來增強模型的視覺處理能力,其中包括低分辨率、高分辨率和以物件為中心的特徵。我們提出了整合額外高分辨率視覺編碼器以捕獲細粒度細節,然後通過Conv-Gate融合網絡將其與基本視覺特徵融合。為了進一步改進模型的物件識別能力,我們還將由離線檢測器識別的邊界框產生的物件級特徵納入。通過僅在公開可用的多模式數據上進行指導調整訓練,MG-LLaVA展示了出色的感知技能。我們使用各種語言編碼器(範圍從3.8B到34B)實例化MG-LLaVA,以全面評估模型的性能。在多個基準測試中進行的廣泛評估表明,MG-LLaVA在相同參數大小的現有MLLMs中表現優異,展示了其卓越的效能。代碼將在https://github.com/PhoenixZ810/MG-LLaVA 上提供。
English
Multi-modal large language models (MLLMs) have made significant strides in
various visual understanding tasks. However, the majority of these models are
constrained to process low-resolution images, which limits their effectiveness
in perception tasks that necessitate detailed visual information. In our study,
we present MG-LLaVA, an innovative MLLM that enhances the model's visual
processing capabilities by incorporating a multi-granularity vision flow, which
includes low-resolution, high-resolution, and object-centric features. We
propose the integration of an additional high-resolution visual encoder to
capture fine-grained details, which are then fused with base visual features
through a Conv-Gate fusion network. To further refine the model's object
recognition abilities, we incorporate object-level features derived from
bounding boxes identified by offline detectors. Being trained solely on
publicly available multimodal data through instruction tuning, MG-LLaVA
demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide
variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's
performance comprehensively. Extensive evaluations across multiple benchmarks
demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter
sizes, showcasing its remarkable efficacy. The code will be available at
https://github.com/PhoenixZ810/MG-LLaVA.Summary
AI-Generated Summary