LongLLaVA:通过混合架构高效扩展多模式LLM到1000张图片
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
September 4, 2024
作者: Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
cs.AI
摘要
扩展多模态大型语言模型(MLLMs)的长上下文能力对于视频理解、高分辨率图像理解和多模态代理至关重要。这涉及一系列系统优化,包括模型架构、数据构建和训练策略,特别是解决诸如随着图像增多而性能下降和高计算成本等挑战。本文将模型架构调整为Mamba和Transformer块的混合体,采用同时考虑多个图像之间的时间和空间依赖性的数据构建方法,并采用渐进式训练策略。发布的模型LongLLaVA(长上下文大型语言与视觉助手)是第一个混合MLLM,实现了效率和有效性之间更好的平衡。LongLLaVA不仅在各种基准测试中取得了竞争力的结果,而且保持了高吞吐量和低内存消耗。特别是,它可以在单个A100 80GB GPU上处理近千张图像,展现了广泛任务的应用前景。
English
Expanding the long-context capabilities of Multi-modal Large Language
Models~(MLLMs) is crucial for video understanding, high-resolution image
understanding, and multi-modal agents. This involves a series of systematic
optimizations, including model architecture, data construction and training
strategy, particularly addressing challenges such as degraded
performance with more images and high computational costs. In this
paper, we adapt the model architecture to a hybrid of Mamba and Transformer
blocks, approach data construction with both temporal and spatial dependencies
among multiple images and employ a progressive training strategy. The released
model LongLLaVA~(Long-Context Large
Language and Vision Assistant) is the first
hybrid MLLM, which achieved a better balance between efficiency and
effectiveness. LongLLaVA not only achieves competitive results across various
benchmarks, but also maintains high throughput and low memory consumption.
Especially, it could process nearly a thousand images on a single A100 80GB
GPU, showing promising application prospects for a wide range of tasks.Summary
AI-Generated Summary