ChatPaper.aiChatPaper

MG-LLaVA:走向多粒度视觉指导调整

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

June 25, 2024
作者: Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang
cs.AI

摘要

多模态大型语言模型(MLLMs)在各种视觉理解任务中取得了显著进展。然而,大多数这些模型受限于处理低分辨率图像,这限制了它们在需要详细视觉信息的感知任务中的有效性。在我们的研究中,我们提出了MG-LLaVA,这是一种创新的MLLM,通过整合多粒度视觉流来增强模型的视觉处理能力,其中包括低分辨率、高分辨率和以对象为中心的特征。我们提出了整合额外高分辨率视觉编码器以捕获细粒度细节,然后通过Conv-Gate融合网络将其与基础视觉特征融合。为了进一步提升模型的对象识别能力,我们还整合了由离线检测器识别的边界框导出的对象级特征。通过仅在公开可用的多模态数据上进行指导调整训练,MG-LLaVA展示了出色的感知能力。我们使用范围从3.8B到34B的多种语言编码器实例化MG-LLaVA,以全面评估模型的性能。在多个基准测试中进行的广泛评估表明,MG-LLaVA在参数大小相当的现有MLLMs上表现出色,展示了其显著的有效性。代码将在https://github.com/PhoenixZ810/MG-LLaVA 上提供。
English
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.

Summary

AI-Generated Summary

PDF191November 29, 2024