LongLLaVA：通过混合架构高效扩展多模式LLM到1000张图片

摘要

扩展多模态大型语言模型（MLLMs）的长上下文能力对于视频理解、高分辨率图像理解和多模态代理至关重要。这涉及一系列系统优化，包括模型架构、数据构建和训练策略，特别是解决诸如随着图像增多而性能下降和高计算成本等挑战。本文将模型架构调整为Mamba和Transformer块的混合体，采用同时考虑多个图像之间的时间和空间依赖性的数据构建方法，并采用渐进式训练策略。发布的模型LongLLaVA（长上下文大型语言与视觉助手）是第一个混合MLLM，实现了效率和有效性之间更好的平衡。LongLLaVA不仅在各种基准测试中取得了竞争力的结果，而且保持了高吞吐量和低内存消耗。特别是，它可以在单个A100 80GB GPU上处理近千张图像，展现了广泛任务的应用前景。

English

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

LongLLaVA：通过混合架构高效扩展多模式LLM到1000张图片

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

摘要

Support