ChatPaper.aiChatPaper

理解与利用统一多模态模型中的稀疏性

Understanding and Harnessing Sparsity in Unified Multimodal Models

December 2, 2025
作者: Shwai He, Chaorui Deng, Ang Li, Shen Yan
cs.AI

摘要

大型多模态模型在理解与生成任务上均取得显著进展。近期研究致力于构建统一的多模态模型,通过整合异构组件在单一框架内支持双重能力。然而这种统一会引发推理低效问题,例如特定任务或样本可能无需调用统一模型的全部知识或容量。但目前对于这些低效现象在不同组件中的具体表现仍缺乏系统性认知。本研究首次采用免训练剪枝作为探测方法,从深度剪枝和宽度缩减两个维度对统一多模态模型组件进行系统分析。实验表明:理解组件在理解与生成任务中均表现出显著可压缩性,且在生成任务中更为明显;而生成组件对压缩高度敏感,即使中等压缩比也会导致性能急剧下降。针对此局限,我们受不同样本间动态激活模式的启发,提出混合专家适配方法。该方案将生成模块划分为多个专家并启用稀疏激活以恢复生成质量。通过专家冻结调优验证稀疏激活的有效性后,进一步证明全参数可训练的适配策略能带来额外增益。最终改进的BAGEL模型仅激活约半数参数即可达到与完整模型相当的性能。代码已发布于https://github.com/Shwai-He/SparseUnifiedModel{此链接}。
English
Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at https://github.com/Shwai-He/SparseUnifiedModel{this link}.
PDF11December 4, 2025