VoCo-LLaMA:朝向利用大型語言模型進行視覺壓縮
VoCo-LLaMA: Towards Vision Compression with Large Language Models
June 18, 2024
作者: Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
cs.AI
摘要
視覺語言模型(VLMs)在各種多模式任務中取得了顯著成功,但通常受限於有限的上下文視窗和處理高解析度圖像輸入和視頻的高計算成本。視覺壓縮可以通過減少視覺標記數量來緩解這個問題。先前的方法使用外部模塊壓縮視覺標記,並強迫LLMs理解壓縮後的標記,導致視覺信息損失。然而,LLMs對視覺標記的理解範式在壓縮學習過程中並未充分利用。我們提出了VoCo-LLaMA,這是第一種使用LLMs壓縮視覺標記的方法。通過在視覺指導調整階段引入視覺壓縮標記,並利用注意力蒸餾,我們的方法將LLMs對視覺標記的理解蒸餾到它們處理VoCo標記的過程中。VoCo-LLaMA促進了有效的視覺壓縮,並在推理階段提高了計算效率。具體來說,我們的方法實現了極小的性能損失,壓縮比達576倍,導致FLOPs減少高達94.8%,推理時間加速69.6%。此外,通過使用視頻幀的時間序列壓縮標記序列進行持續訓練,VoCo-LLaMA展示了理解時間相關性的能力,在流行的視頻問答基準測試中優於先前的方法。我們的方法提供了一種有望發揮VLMs上下文視窗潛力的方式,從而實現更具規模的多模式應用。項目頁面以及相關代碼可通過https://yxxxb.github.io/VoCo-LLaMA-page/{此https URL}訪問。
English
Vision-Language Models (VLMs) have achieved remarkable success in various
multi-modal tasks, but they are often bottlenecked by the limited context
window and high computational cost of processing high-resolution image inputs
and videos. Vision compression can alleviate this problem by reducing the
vision token count. Previous approaches compress vision tokens with external
modules and force LLMs to understand the compressed ones, leading to visual
information loss. However, the LLMs' understanding paradigm of vision tokens is
not fully utilised in the compression learning process. We propose VoCo-LLaMA,
the first approach to compress vision tokens using LLMs. By introducing Vision
Compression tokens during the vision instruction tuning phase and leveraging
attention distillation, our method distill how LLMs comprehend vision tokens
into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision
compression and improves the computational efficiency during the inference
stage. Specifically, our method achieves minimal performance loss with a
compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and
69.6% acceleration in inference time. Furthermore, through continuous
training using time-series compressed token sequences of video frames,
VoCo-LLaMA demonstrates the ability to understand temporal correlations,
outperforming previous methods on popular video question-answering benchmarks.
Our approach presents a promising way to unlock the full potential of VLMs'
contextual window, enabling more scalable multi-modal applications. The project
page, along with the associated code, can be accessed via
https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.Summary
AI-Generated Summary