MoE-LLaVA: 대규모 시각-언어 모델을 위한 전문가 혼합 모델

초록

대규모 시각-언어 모델(LVLMs)의 경우, 모델 규모를 확장하는 것이 성능 향상에 효과적입니다. 그러나 모델 파라미터를 크게 늘리면 각 토큰 계산 시 모든 모델 파라미터가 활성화되기 때문에 훈련 및 추론 비용이 크게 증가합니다. 본 연구에서는 LVLMs를 위한 새로운 훈련 전략인 MoE-tuning을 제안합니다. 이 방법은 엄청난 수의 파라미터를 가지지만 일정한 계산 비용을 유지하는 희소 모델을 구성하며, 다중 모달 학습과 모델 희소성과 관련된 성능 저하 문제를 효과적으로 해결합니다. 또한, MoE 기반의 희소 LVLM 아키텍처인 MoE-LLaVA 프레임워크를 제시합니다. 이 프레임워크는 배포 시 라우터를 통해 상위 k개의 전문가만 독특하게 활성화하고 나머지 전문가는 비활성 상태로 유지합니다. 광범위한 실험을 통해 MoE-LLaVA가 시각 이해 능력에서 우수한 성능을 보이며 모델 출력에서의 환각 현상을 줄일 수 있는 잠재력을 확인했습니다. 특히, 희소하게 활성화된 30억 개의 파라미터만으로도 MoE-LLaVA는 다양한 시각 이해 데이터셋에서 LLaVA-1.5-7B와 비슷한 성능을 보였으며, 객체 환각 벤치마크에서는 LLaVA-1.5-13B를 능가했습니다. MoE-LLaVA를 통해 희소 LVLMs의 기준을 설정하고, 보다 효율적이고 효과적인 다중 모달 학습 시스템 개발을 위한 미래 연구에 유용한 통찰을 제공하고자 합니다. 코드는 https://github.com/PKU-YuanGroup/MoE-LLaVA에서 공개되었습니다.

English

For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

MoE-LLaVA: 대규모 시각-언어 모델을 위한 전문가 혼합 모델

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

초록

Support