VoCo-LLaMA：朝向利用大型語言模型進行視覺壓縮

摘要

視覺語言模型（VLMs）在各種多模式任務中取得了顯著成功，但通常受限於有限的上下文視窗和處理高解析度圖像輸入和視頻的高計算成本。視覺壓縮可以通過減少視覺標記數量來緩解這個問題。先前的方法使用外部模塊壓縮視覺標記，並強迫LLMs理解壓縮後的標記，導致視覺信息損失。然而，LLMs對視覺標記的理解範式在壓縮學習過程中並未充分利用。我們提出了VoCo-LLaMA，這是第一種使用LLMs壓縮視覺標記的方法。通過在視覺指導調整階段引入視覺壓縮標記，並利用注意力蒸餾，我們的方法將LLMs對視覺標記的理解蒸餾到它們處理VoCo標記的過程中。VoCo-LLaMA促進了有效的視覺壓縮，並在推理階段提高了計算效率。具體來說，我們的方法實現了極小的性能損失，壓縮比達576倍，導致FLOPs減少高達94.8％，推理時間加速69.6％。此外，通過使用視頻幀的時間序列壓縮標記序列進行持續訓練，VoCo-LLaMA展示了理解時間相關性的能力，在流行的視頻問答基準測試中優於先前的方法。我們的方法提供了一種有望發揮VLMs上下文視窗潛力的方式，從而實現更具規模的多模式應用。項目頁面以及相關代碼可通過https://yxxxb.github.io/VoCo-LLaMA-page/{此https URL}訪問。

English

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.

VoCo-LLaMA：朝向利用大型語言模型進行視覺壓縮

VoCo-LLaMA: Towards Vision Compression with Large Language Models

摘要

Support