NaViL: 데이터 제약 하에서 네이티브 멀티모달 대형 언어 모델의 스케일링 특성 재고

초록

기존의 다중모달 대형 언어 모델(MLLM)에서는 컴포지셔널 트레이닝이 사실상의 표준 패러다임으로 자리 잡아 왔습니다. 이 접근법에서는 사전 학습된 비전 인코더와 사전 학습된 대형 언어 모델(LLM)을 연속적인 다중모달 사전 학습을 통해 연결합니다. 그러나 이러한 분리된 학습 방식으로 인해 이 패러다임의 다중모달 스케일링 특성을 탐구하기는 어려웠습니다. 본 논문에서는 종단 간(end-to-end) 방식으로 MLLM을 네이티브하게 학습하는 데 초점을 맞추고, 데이터 제약이라는 실용적인 설정 하에서 그 설계 공간과 스케일링 특성을 체계적으로 연구합니다. MLLM의 다양한 선택 사항을 신중히 연구한 결과, 성능과 학습 비용을 최적으로 균형 잡는 최적의 메타 아키텍처를 도출했습니다. 이후, 네이티브 MLLM의 스케일링 특성을 추가로 탐구하며 비전 인코더와 LLM 간의 양의 상관관계를 확인했습니다. 이러한 발견을 바탕으로, 간단하고 비용 효율적인 레시피와 결합된 NaViL이라는 네이티브 MLLM을 제안합니다. 14개의 다중모달 벤치마크에서의 실험 결과는 NaViL이 기존 MLLM 대비 경쟁력 있는 성능을 보임을 확인시켜 줍니다. 또한, 본 연구의 발견과 결과는 향후 네이티브 MLLM 연구에 대한 심층적인 통찰을 제공합니다.

English

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

NaViL: 데이터 제약 하에서 네이티브 멀티모달 대형 언어 모델의 스케일링 특성 재고

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

초록

Support