VoCo-LLaMA: 大規模言語モデルを用いた視覚圧縮へのアプローチ

要旨

Vision-Language Models（VLM）は、様々なマルチモーダルタスクで顕著な成功を収めていますが、高解像度の画像入力やビデオを処理する際の限られたコンテキストウィンドウと高い計算コストによってしばしばボトルネックとなっています。視覚圧縮は、視覚トークンの数を減らすことでこの問題を緩和することができます。従来のアプローチでは、外部モジュールを使用して視覚トークンを圧縮し、LLMに圧縮されたトークンを理解させることで、視覚情報の損失を引き起こしていました。しかし、LLMの視覚トークン理解パラダイムは、圧縮学習プロセスで十分に活用されていませんでした。我々は、LLMを使用して視覚トークンを圧縮する最初のアプローチであるVoCo-LLaMAを提案します。視覚指示チューニングフェーズでVision Compressionトークンを導入し、アテンションディスティレーションを活用することで、LLMが視覚トークンを理解する方法をVoCoトークンの処理に蒸留します。VoCo-LLaMAは、効果的な視覚圧縮を促進し、推論段階での計算効率を向上させます。具体的には、我々の方法は576倍の圧縮率で最小限の性能損失を達成し、FLOPsを最大94.8%削減し、推論時間を69.6%加速します。さらに、ビデオフレームの時系列圧縮トークンシーケンスを使用した継続的なトレーニングを通じて、VoCo-LLaMAは時間的相関を理解する能力を示し、人気のあるビデオ質問応答ベンチマークで以前の方法を上回ります。我々のアプローチは、VLMのコンテキストウィンドウの全潜在能力を引き出す有望な方法を提示し、よりスケーラブルなマルチモーダルアプリケーションを可能にします。プロジェクトページと関連コードは、https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}からアクセスできます。

English

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.

VoCo-LLaMA: 大規模言語モデルを用いた視覚圧縮へのアプローチ

VoCo-LLaMA: Towards Vision Compression with Large Language Models

要旨

Support