ChatPaper.aiChatPaper

MoTE:三元專家混合體,用於記憶體高效的大型多模態模型

MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

June 17, 2025
作者: Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen
cs.AI

摘要

大型多模態專家混合模型(Mixture-of-Experts, MoEs)有效地擴展了模型規模以提升性能,同時保持固定的活躍參數。然而,先前的研究主要在稀疏升級過程中使用了全精度專家。儘管這些方法在最終任務上表現出優異的性能,但大量的專家引入了更高的記憶體佔用,這對邊緣設備的部署構成了重大挑戰。在本研究中,我們提出了MoTE,一種可擴展且記憶體高效的方法,用於從密集檢查點訓練三元專家混合模型。我們建議在升級過程中訓練更多低精度專家,而非訓練較少的高精度專家。具體而言,我們使用預訓練的前饋神經網絡(FFN)作為共享專家,並訓練參數為{-1, 0, 1}的三元路由專家。大量實驗表明,我們的方法在模型規模上展現出良好的擴展趨勢。MoTE在性能上與全精度基線MoE-LLaVA相當,同時提供了更低的記憶體佔用。此外,我們的方法與訓練後量化方法兼容,當記憶體限制降低時,其優勢進一步放大。在專家記憶體佔用相同為3.4GB的情況下,結合訓練後量化,MoTE在最終任務上的平均準確率比MoE-LLaVA高出4.3%,展示了其在記憶體受限設備上的有效性和潛力。
English
Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.
PDF52June 19, 2025