犠牲なく効率化 - LMMにおける計算冗長性の削減

要旨

大規模マルチモーダルモデルはマルチモーダルタスクにおいて優れた性能を発揮しますが、視覚トークンに対する過剰な計算により、大きな計算上の課題に直面しています。トークンレベルの冗長性に焦点を当てたトークン削減手法とは異なり、我々は情報の損失を防ぐために、視覚トークンにおける計算レベルの冗長性を特定し、研究しました。重要な洞察として、事前学習済みの視覚エンコーダから得られる視覚トークンは、デコーダのみの大規模マルチモーダルモデルにおいて、必ずしも全ての重い操作（例えば、自己注意機構やフィードフォワードネットワーク）を必要とせず、適切な設計により軽量に処理できる可能性があります。我々は、視覚関連の計算冗長性を発見し、段階的に削減するための一連の実験を設計しました。これらの発見に基づいて、我々はProxyVという新しいアプローチを提案します。ProxyVは、オリジナルの視覚トークンに対する計算負荷を軽減するために、プロキシ視覚トークンを利用します。ProxyVは、性能を損なうことなく効率を向上させ、より穏やかな効率改善のシナリオにおいても顕著な性能向上をもたらすことができます。さらに、ProxyVの柔軟性は、トークン削減手法との組み合わせにより、さらなる効率向上を実現することで示されています。コードは以下のURLで公開されます: https://github.com/penghao-wu/ProxyV。

English

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

犠牲なく効率化 - LMMにおける計算冗長性の削減

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

要旨

Support