高效多模態大型語言模型:基於漸進一致性蒸餾
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
October 1, 2025
作者: Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang
cs.AI
摘要
在多模態大型模型(MLLMs)中,視覺標記消耗了大量的計算資源,顯著降低了其效率。近期研究嘗試通過在訓練過程中壓縮視覺標記來提升效率,這包括對模型組件的修改或引入額外參數。然而,這些方法往往忽視了壓縮所帶來的學習難度增加,因為模型的參數空間難以迅速適應由標記壓縮引起的特徵空間中的顯著擾動。在本研究中,我們提出通過漸進一致性蒸餾(EPIC)來開發高效的多模態大型模型,這是一種漸進學習框架。具體而言,通過沿著標記維度和層次維度分解由標記壓縮引入的特徵空間擾動,我們分別引入了標記一致性蒸餾和層次一致性蒸餾,旨在利用教師模型的指導並遵循漸進學習軌跡來降低訓練難度。大量實驗證明了我們所提出框架在有效性、魯棒性和泛化能力方面的卓越表現。
English
Visual tokens consume substantial computational resources in multi-modal
large models (MLLMs), significantly compromising their efficiency. Recent works
have attempted to improve efficiency by compressing visual tokens during
training, either through modifications to model components or by introducing
additional parameters. However, they often overlook the increased learning
difficulty caused by such compression, as the model's parameter space struggles
to quickly adapt to the substantial perturbations in the feature space induced
by token compression. In this work, we propose to develop Efficient MLLMs via
Progressive Consistency Distillation (EPIC), a progressive learning framework.
Specifically, by decomposing the feature space perturbations introduced by
token compression along the token-wise and layer-wise dimensions, we introduce
token consistency distillation and layer consistency distillation,
respectively, aiming to reduce the training difficulty by leveraging guidance
from a teacher model and following a progressive learning trajectory. Extensive
experiments demonstrate the superior effectiveness, robustness, and
generalization capabilities of our proposed framework.