VisionTrim: 훈련 없이 MLLM 가속화를 위한 통합 비전 토큰 압축

초록

멀티모달 대규모 언어 모델(MLLM)은 특히 고해상도 및 비디오 기반 시나리오에서 과도한 시각 토큰으로 인해 높은 계산 비용 문제를 겪고 있습니다. 기존 토큰 감소 방법은 일반적으로 개별 파이프라인 구성 요소에 집중하고 텍스트 정합성을 간과하는 경우가 많아 성능 저하를 초래합니다. 본 논문에서는 훈련 없이 적용 가능한 MLLM 가속화를 위한 통합 프레임워크인 VisionTrim을 제안합니다. 이 프레임워크는 두 가지 효과적인 플러그인 플레이 모듈을 통합합니다: 1) 글로벌-로컬 뷰를 통해 핵심 시각 토큰을 보존하는 Dominant Vision Token Selection(DVTS) 모듈과 2) 텍스트 단서에 기반한 컨텍스트 인식 토큰 병합을 용이하게 하는 Text-Guided Vision Complement(TGVC) 모듈입니다. 다양한 이미지 및 비디오 멀티모달 벤치마크에서 수행한 폭넓은 실험을 통해 우리의 VisionTrim이 성능 우수성을 입증하며, 실제 응용 프로그램에서의 실용적인 MLLM 배치를 앞당기고 있음을 보여줍니다. 코드는 https://github.com/hanxunyu/VisionTrim에서 확인할 수 있습니다.

English

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

VisionTrim: 훈련 없이 MLLM 가속화를 위한 통합 비전 토큰 압축

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

초록

Support