점진적 일관성 증류를 통한 효율적인 다중 모달 대형 언어 모델

초록

다중 모달 대형 모델(MLLM)에서 시각적 토큰은 상당한 계산 자원을 소모하여 모델의 효율성을 크게 저하시킵니다. 최근 연구들은 모델 구성 요소를 수정하거나 추가 매개변수를 도입하여 훈련 중 시각적 토큰을 압축함으로써 효율성을 개선하려고 시도했습니다. 그러나 이러한 압축으로 인해 발생하는 학습 난이도의 증가는 종종 간과되는데, 이는 모델의 매개변수 공간이 토큰 압축으로 인한 특징 공간의 상당한 변화에 빠르게 적응하기 어렵기 때문입니다. 본 연구에서는 점진적 일관성 증류(Progressive Consistency Distillation, EPIC)를 통해 효율적인 MLLM을 개발하는 점진적 학습 프레임워크를 제안합니다. 구체적으로, 토큰 압축으로 인한 특징 공간의 변화를 토큰 차원과 계층 차원으로 분해하여, 각각 토큰 일관성 증류와 계층 일관성 증류를 도입했습니다. 이를 통해 교사 모델의 지도를 활용하고 점진적 학습 경로를 따름으로써 학습 난이도를 줄이고자 합니다. 광범위한 실험을 통해 제안된 프레임워크의 우수한 효과성, 견고성 및 일반화 능력을 입증했습니다.

English

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

점진적 일관성 증류를 통한 효율적인 다중 모달 대형 언어 모델

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

초록

Support