LongLLaVA: 하이브리드 아키텍처를 통해 1000개의 이미지에 효율적으로 확장하는 멀티모달 LLMs

초록

다중 모달 대형 언어 모델(MLLMs)의 장거리 문맥 기능을 확장하는 것은 비디오 이해, 고해상도 이미지 이해 및 다중 모달 에이전트에 대해 중요하다. 이는 모델 아키텍처, 데이터 구축 및 교육 전략을 포함한 일련의 체계적인 최적화를 필요로 하며, 특히 더 많은 이미지와 높은 계산 비용과 같은 과제에 대한 성능 하락 문제를 해결해야 한다. 본 논문에서는 Mamba와 Transformer 블록의 혼합을 통한 모델 아키텍처를 적응하고, 다중 이미지 사이의 시간적 및 공간적 종속성을 고려한 데이터 구축 방법을 채택하고, 점진적 교육 전략을 활용한다. LongLLaVA(Long-Context Large Language and Vision Assistant)라는 공개된 모델은 효율성과 효과성 사이의 더 나은 균형을 달성한 최초의 하이브리드 MLLM이다. LongLLaVA는 다양한 벤치마크에서 경쟁력 있는 결과를 달성할 뿐만 아니라 높은 처리량과 낮은 메모리 소비를 유지한다. 특히, 단일 A100 80GB GPU에서 거의 천 장의 이미지를 처리할 수 있어 다양한 작업에 대한 유망한 응용 가능성을 보여준다.

English

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

LongLLaVA: 하이브리드 아키텍처를 통해 1000개의 이미지에 효율적으로 확장하는 멀티모달 LLMs

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

초록

Support