VoCo-LLaMA：走向利用大型语言模型进行视觉压缩

摘要

视觉-语言模型（VLMs）在各种多模态任务中取得了显著成功，但通常受限于有限的上下文窗口和处理高分辨率图像输入和视频的高计算成本。视觉压缩可以通过减少视觉令牌数量来缓解这一问题。先前的方法使用外部模块压缩视觉令牌，并强制LLMs理解压缩后的令牌，导致视觉信息丢失。然而，LLMs对视觉令牌的理解范式在压缩学习过程中并未充分利用。我们提出了VoCo-LLaMA，这是第一个使用LLMs压缩视觉令牌的方法。通过在视觉指导调整阶段引入视觉压缩令牌，并利用注意力蒸馏，我们的方法将LLMs理解视觉令牌的方式蒸馏到它们处理VoCo令牌的过程中。VoCo-LLaMA有助于有效的视觉压缩，并在推断阶段提高了计算效率。具体而言，我们的方法在压缩比达到576倍时实现了最小的性能损失，导致FLOPs减少高达94.8％，推断时间加速69.6％。此外，通过使用视频帧的时间序列压缩令牌序列进行持续训练，VoCo-LLaMA展示了理解时间相关性的能力，在流行的视频问答基准测试中胜过先前的方法。我们的方法展示了释放VLMs上下文窗口的全部潜力的一种有前途的方式，从而实现更具规模化的多模态应用。项目页面以及相关代码可通过以下链接访问：https://yxxxb.github.io/VoCo-LLaMA-page/{此https链接}。

English

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.

VoCo-LLaMA：走向利用大型语言模型进行视觉压缩

VoCo-LLaMA: Towards Vision Compression with Large Language Models

摘要

Support