LongLLaVA:通過混合架構高效擴展多模式LLM到1000張圖像
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
September 4, 2024
作者: Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
cs.AI
摘要
擴展多模式大型語言模型(MLLMs)的長篇文本能力對於視頻理解、高解析度圖像理解和多模式代理至關重要。這涉及一系列系統優化,包括模型架構、數據構建和訓練策略,特別是應對諸如隨著圖像增加而性能下降和高計算成本等挑戰。在本文中,我們將模型架構調整為Mamba和Transformer塊的混合,通過考慮多個圖像之間的時間和空間依賴性來處理數據構建,並採用漸進式訓練策略。釋出的模型LongLLaVA(Long-Context Large Language and Vision Assistant)是第一個混合MLLM,實現了效率和效果之間更好的平衡。LongLLaVA不僅在各種基準測試中取得了競爭力的結果,而且保持了高吞吐量和低內存消耗。特別是,它可以在單個A100 80GB GPU上處理近千幅圖像,展示了廣泛任務的應用前景。
English
Expanding the long-context capabilities of Multi-modal Large Language
Models~(MLLMs) is crucial for video understanding, high-resolution image
understanding, and multi-modal agents. This involves a series of systematic
optimizations, including model architecture, data construction and training
strategy, particularly addressing challenges such as degraded
performance with more images and high computational costs. In this
paper, we adapt the model architecture to a hybrid of Mamba and Transformer
blocks, approach data construction with both temporal and spatial dependencies
among multiple images and employ a progressive training strategy. The released
model LongLLaVA~(Long-Context Large
Language and Vision Assistant) is the first
hybrid MLLM, which achieved a better balance between efficiency and
effectiveness. LongLLaVA not only achieves competitive results across various
benchmarks, but also maintains high throughput and low memory consumption.
Especially, it could process nearly a thousand images on a single A100 80GB
GPU, showing promising application prospects for a wide range of tasks.Summary
AI-Generated Summary