効率的なマルチモーダル大規模言語モデルのための漸進的一貫性蒸留

要旨

視覚トークンはマルチモーダル大規模モデル（MLLMs）において、計算リソースを大量に消費し、その効率を著しく低下させます。最近の研究では、モデルコンポーネントの変更や追加パラメータの導入を通じて、トレーニング中の視覚トークンを圧縮することで効率を向上させようとする試みがなされています。しかし、これらの圧縮によって引き起こされる特徴空間の大幅な摂動にモデルのパラメータ空間が迅速に適応できないため、学習の難易度が増大する点がしばしば見過ごされています。本研究では、Progressive Consistency Distillation（EPIC）を用いた効率的なMLLMsの開発を提案します。具体的には、トークン圧縮によって導入される特徴空間の摂動をトークン次元と層次元に分解し、それぞれトークン一貫性蒸留と層一貫性蒸留を導入することで、教師モデルからのガイダンスを活用し、段階的な学習軌跡に従うことでトレーニングの難易度を低減することを目指します。広範な実験により、提案するフレームワークの優れた有効性、堅牢性、および汎化能力が実証されています。

English

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

効率的なマルチモーダル大規模言語モデルのための漸進的一貫性蒸留

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

要旨

Support