LongLLaVA: ハイブリッドアーキテクチャを使用して、1000枚の画像に効率的にスケーリングするマルチモーダルLLM

要旨

マルチモーダル大規模言語モデル（MLLMs）の長いコンテキスト能力を拡張することは、ビデオ理解、高解像度画像理解、およびマルチモーダルエージェントにとって重要です。これには、モデルアーキテクチャ、データ構築とトレーニング戦略を含む一連の体系的最適化が必要であり、特により多くの画像や高い計算コストといった課題に対処することが求められます。本論文では、MambaとTransformerブロックのハイブリッドにモデルアーキテクチャを適応し、複数の画像間の時間的および空間的依存関係を考慮したデータ構築アプローチを採用し、プログレッシブなトレーニング戦略を用いています。公開されたモデルLongLLaVA（Long-Context Large Language and Vision Assistant）は、初のハイブリッドMLLMであり、効率と効果のバランスを向上させました。LongLLaVAは、さまざまなベンチマークで競争力のある結果を達成するだけでなく、高いスループットと低いメモリ消費を維持しています。特に、A100 80GBの単一GPUでほぼ千枚の画像を処理できるため、幅広いタスクにおける有望な応用展望が示されています。

English

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

LongLLaVA: ハイブリッドアーキテクチャを使用して、1000枚の画像に効率的にスケーリングするマルチモーダルLLM

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

要旨

Support