MST-Distill：跨模态知识蒸馏的专家教师混合模型

摘要

作为一种高效的知识迁移技术，知识蒸馏在单模态场景中已取得显著成功。然而，在跨模态环境下，传统蒸馏方法因数据和统计异质性面临重大挑战，难以充分利用跨模态教师模型中蕴含的互补先验知识。本文通过实证揭示了现有方法中的两个关键问题：蒸馏路径选择与知识漂移。为克服这些局限，我们提出了MST-Distill，一种新颖的跨模态知识蒸馏框架，其特色在于采用混合专家教师模型。该方法集成了跨模态与多模态配置下的多样化教师模型集合，并结合实例级路由网络，实现自适应、动态的蒸馏过程，有效突破了依赖单一静态教师模型的传统方法限制。此外，我们引入了一个可插拔的掩码模块，该模块独立训练以抑制模态特异性差异并重构教师表征，从而缓解知识漂移，提升迁移效果。在涵盖视觉、音频和文本的五个多样化多模态数据集上的广泛实验表明，我们的方法在跨模态蒸馏任务中显著优于现有最先进的知识蒸馏技术。源代码已发布于https://github.com/Gray-OREO/MST-Distill。

English

Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.

MST-Distill：跨模态知识蒸馏的专家教师混合模型

MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation

摘要

Support