DUET-VLM: VLMの学習と推論のための二段階統合効率的トークン削減

要旨

視覚言語モデル（VLM）は、優れたマルチモーダル理解・推論能力を実現しているが、高密度な視覚トークン化により計算コストが高い課題がある。既存の効率化手法は、冗長な視覚トークンを統合するか、言語バックボーン内で段階的に削除するものが多く、精度と速度のトレードオフを伴う。本研究では、汎用性の高いプラグアンドプレイ型デュアル圧縮フレームワーク「DUET-VLM」を提案する。これは、(a) 視覚エンコーダの出力を情報を保持したトークンに圧縮する視覚専用の冗長性認識圧縮と、(b) 言語バックボーン内で段階的に情報量の少ない視覚トークンを剪定する、層単位のテキスト誘導型重要度に基づくトークン削除から構成される。この協調的なトークン管理により、批判的意味を保持しつつ積極的な圧縮を実現する。LLaVA-1.5-7Bでは、ベースライン精度の99%以上を67%のトークン削減で維持し、89%削減時でも97%以上の精度を保持する。訓練時のこの二段階圧縮により、67%削減で99.7%、89%削減で97.6%の精度を達成し、複数ベンチマークで従来のSoTA視覚トークン削減手法を凌駕する。Video-LLaVA-7Bに統合した場合、53.1%の大幅なトークン削減でベースラインを上回る100%超の精度を達成し、極端な93.4%削減設定下でも97.6%の精度を維持する。これらの結果は、DUET-VLMによるエンドツーエンド訓練が、精度を犠牲にすることなく削減された視覚（画像/動画）入力への頑健な適応を可能にし、同一計算予算内でコンパクトかつ意味的に豊富な表現を生成することを示す。コードはhttps://github.com/AMD-AGI/DUET-VLM で公開されている。

English

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

DUET-VLM: VLMの学習と推論のための二段階統合効率的トークン削減

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

要旨

Support