MoTE: 메모리 효율적 대규모 멀티모달 모델을 위한 삼항 전문가 혼합 모델

초록

대규모 멀티모달 Mixture-of-Experts(MoE)는 고정된 활성 매개변수를 유지하면서 모델 크기를 효과적으로 확장하여 성능을 향상시킵니다. 그러나 기존 연구들은 주로 희소 업사이클링 동안 완전 정밀도 전문가들을 활용했습니다. 이들은 최종 작업에서 우수한 성능을 보이지만, 많은 수의 전문가들이 더 높은 메모리 사용량을 초래하여 에지 디바이스에서의 배포에 상당한 어려움을 야기합니다. 본 연구에서는 밀집 체크포인트에서 Ternary 전문가들의 혼합 모델(Mixture-of-Ternary-Experts, MoTE)을 학습하기 위한 확장 가능하고 메모리 효율적인 접근 방식을 제안합니다. 더 적은 수의 고정밀도 전문가를 학습하는 대신, 업사이클링 동안 더 많은 저정밀도 전문가를 학습하는 것을 제안합니다. 구체적으로, 사전 학습된 FFN을 공유 전문가로 사용하고, 매개변수가 {-1, 0, 1}인 삼진 라우팅 전문가를 학습합니다. 광범위한 실험을 통해 우리의 접근 방식이 모델 크기에 따라 유망한 확장 추세를 보임을 확인했습니다. MoTE는 완전 정밀도 기준선인 MoE-LLaVA와 비슷한 성능을 달성하면서 더 낮은 메모리 사용량을 제공합니다. 또한, 우리의 접근 방식은 학습 후 양자화 방법과 호환되며, 메모리 제약이 더 낮아질 때 그 장점이 더욱 증폭됩니다. 전문가 메모리 사용량이 3.4GB로 동일한 조건에서 학습 후 양자화와 결합했을 때, MoTE는 MoE-LLaVA를 최종 작업에서 평균 정확도 4.3% 향상시키며, 메모리 제약이 있는 디바이스에서의 효과성과 잠재력을 입증했습니다.

English

Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

MoTE: 메모리 효율적 대규모 멀티모달 모델을 위한 삼항 전문가 혼합 모델

MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

초록

Support