通过渐进一致性蒸馏实现高效多模态大语言模型

摘要

在多模态大模型（MLLMs）中，视觉标记消耗了大量计算资源，显著影响了模型效率。近期研究尝试通过在训练过程中压缩视觉标记来提升效率，方法包括修改模型组件或引入额外参数。然而，这些方法往往忽视了压缩带来的学习难度增加，因为模型参数空间难以快速适应由标记压缩引起的特征空间显著扰动。在本研究中，我们提出通过渐进一致性蒸馏（EPIC）开发高效MLLMs，这是一种渐进式学习框架。具体而言，通过沿标记维度和层级维度分解由标记压缩引入的特征空间扰动，我们分别引入了标记一致性蒸馏和层级一致性蒸馏，旨在借助教师模型的指导并遵循渐进学习轨迹，降低训练难度。大量实验验证了我们提出框架在有效性、鲁棒性和泛化能力方面的卓越表现。

English

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

通过渐进一致性蒸馏实现高效多模态大语言模型

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

摘要

Support