精简而不妥协——消除大型多模态模型中的计算冗余

摘要

大型多模态模型在多模态任务中表现出色，但在处理视觉标记时面临巨大的计算挑战，主要源于对视觉标记的过度计算。与专注于标记级冗余的标记缩减方法不同，我们识别并研究了视觉标记在计算层面的冗余，以确保信息无损失。我们的核心洞察是，来自预训练视觉编码器的视觉标记在仅解码器的大型多模态模型中，并不必然需要所有繁重的操作（如自注意力机制、前馈网络），通过合理设计，可以更轻量地处理这些标记。我们设计了一系列实验，旨在发现并逐步压缩与视觉相关的计算冗余。基于这些发现，我们提出了ProxyV，一种利用代理视觉标记来减轻原始视觉标记计算负担的新方法。ProxyV在不牺牲性能的前提下提升了效率，甚至在效率提升较为温和的场景下还能带来显著的性能增益。此外，ProxyV的灵活性体现在其与标记缩减方法的结合上，进一步提升了效率。代码将公开于https://github.com/penghao-wu/ProxyV。

English

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

精简而不妥协——消除大型多模态模型中的计算冗余

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

摘要

Support