LongLLaVA：通過混合架構高效擴展多模式LLM到1000張圖像

摘要

擴展多模式大型語言模型（MLLMs）的長篇文本能力對於視頻理解、高解析度圖像理解和多模式代理至關重要。這涉及一系列系統優化，包括模型架構、數據構建和訓練策略，特別是應對諸如隨著圖像增加而性能下降和高計算成本等挑戰。在本文中，我們將模型架構調整為Mamba和Transformer塊的混合，通過考慮多個圖像之間的時間和空間依賴性來處理數據構建，並採用漸進式訓練策略。釋出的模型LongLLaVA（Long-Context Large Language and Vision Assistant）是第一個混合MLLM，實現了效率和效果之間更好的平衡。LongLLaVA不僅在各種基準測試中取得了競爭力的結果，而且保持了高吞吐量和低內存消耗。特別是，它可以在單個A100 80GB GPU上處理近千幅圖像，展示了廣泛任務的應用前景。

English

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

LongLLaVA：通過混合架構高效擴展多模式LLM到1000張圖像

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

摘要

Support