ChatPaper.aiChatPaper

DyMU:面向高效视觉语言模型的动态合并与虚拟解合技术

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

April 23, 2025
作者: Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
cs.AI

摘要

我们提出了DyMU,这是一种高效、无需训练的动态框架,能够在保持高任务性能的同时,显著降低视觉-语言模型(VLMs)的计算负担。该框架包含两大核心组件。首先,动态令牌合并(DToMe)通过根据图像复杂度合并相似令牌,减少了视觉令牌嵌入的数量,有效解决了视觉Transformer中固定长度输出固有的效率问题。其次,虚拟令牌解合并(VTU)通过高效重构完整序列的注意力动态,模拟了大型语言模型(LLMs)预期的令牌序列,从而在不进行额外微调的情况下,保持了下游任务的性能。与以往方法不同,我们的方法能够根据图像内容动态调整令牌压缩程度,且完全无需训练,使其能够轻松应用于大多数先进的VLM架构。在图像和视频理解任务上的大量实验表明,DyMU能够将平均视觉令牌数量减少32%-85%,同时在包括近期流行的基于AnyRes的视觉编码器在内的多种VLM架构上,实现了与完整长度模型相当的性能。此外,通过定性分析,我们展示了DToMe能够根据图像复杂度有效调整令牌减少量,与现有系统不同,它为用户提供了对计算成本的更多控制。项目页面:https://mikewangwzhl.github.io/dymu/。
English
We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

Summary

AI-Generated Summary

PDF122April 25, 2025