VoCo-LLaMA: 대규모 언어 모델을 활용한 비전 압축 기술의 방향성

초록

비전-언어 모델(VLMs)은 다양한 멀티모달 작업에서 놀라운 성과를 거두었지만, 고해상도 이미지 입력과 비디오를 처리하는 데 필요한 제한된 컨텍스트 윈도우와 높은 계산 비용으로 인해 종종 병목 현상에 직면합니다. 비전 압축은 비전 토큰 수를 줄여 이 문제를 완화할 수 있습니다. 기존의 접근 방식은 외부 모듈을 사용하여 비전 토큰을 압축하고, 대형 언어 모델(LLMs)이 압축된 토큰을 이해하도록 강제함으로써 시각 정보의 손실을 초래했습니다. 그러나 LLMs의 비전 토큰 이해 패러다임은 압축 학습 과정에서 충분히 활용되지 않았습니다. 우리는 LLMs를 사용하여 비전 토큰을 압축하는 첫 번째 접근 방식인 VoCo-LLaMA를 제안합니다. 비전 명령 튜닝 단계에서 비전 압축 토큰을 도입하고 주의력 증류(attention distillation)를 활용함으로써, 우리의 방법은 LLMs가 비전 토큰을 이해하는 방식을 VoCo 토큰 처리에 증류합니다. VoCo-LLaMA는 효과적인 비전 압축을 촉진하고 추론 단계에서의 계산 효율성을 향상시킵니다. 구체적으로, 우리의 방법은 576배의 압축 비율로 최소한의 성능 손실을 달성하며, FLOPs를 최대 94.8% 줄이고 추론 시간을 69.6% 단축합니다. 또한, 비디오 프레임의 시계열 압축 토큰 시퀀스를 사용한 지속적인 학습을 통해, VoCo-LLaMA는 시간적 상관관계를 이해하는 능력을 보여주며, 인기 있는 비디오 질의응답 벤치마크에서 이전 방법들을 능가합니다. 우리의 접근 방식은 VLMs의 컨텍스트 윈도우의 전체 잠재력을 해제하여 더 확장 가능한 멀티모달 애플리케이션을 가능하게 하는 유망한 방법을 제시합니다. 프로젝트 페이지와 관련 코드는 https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}에서 확인할 수 있습니다.

English

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576times, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/{this https URL}.

VoCo-LLaMA: 대규모 언어 모델을 활용한 비전 압축 기술의 방향성

VoCo-LLaMA: Towards Vision Compression with Large Language Models

초록

Support