MoTE：面向内存高效大型多模态模型的三元专家混合架构

摘要

大规模多模态专家混合模型（MoEs）通过有效扩展模型规模来提升性能，同时保持固定的激活参数。然而，先前的研究主要在稀疏升级过程中使用全精度专家。尽管这些方法在最终任务上表现出优越性能，但大量专家引入了更高的内存占用，这对边缘设备的部署构成了重大挑战。在本研究中，我们提出了MoTE，一种可扩展且内存高效的方法，用于从密集检查点训练三元专家混合模型。我们建议在升级过程中训练更多低精度专家，而非训练较少的高精度专家。具体而言，我们使用预训练的前馈网络（FFN）作为共享专家，并训练参数为{-1, 0, 1}的三元路由专家。大量实验表明，我们的方法在模型规模上展现出良好的扩展趋势。MoTE在保持较低内存占用的同时，实现了与全精度基线MoE-LLaVA相当的性能。此外，我们的方法与训练后量化方法兼容，当内存限制进一步降低时，其优势更加显著。在专家内存占用均为3.4GB的情况下，结合训练后量化，MoTE在最终任务上的平均准确率比MoE-LLaVA高出4.3%，证明了其在内存受限设备上的有效性和潜力。

English

Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

MoTE：面向内存高效大型多模态模型的三元专家混合架构

MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

摘要

Support