DyMU: 효율적인 시각-언어 모델을 위한 동적 병합 및 가상 병합 해제

초록

우리는 시각-언어 모델(VLMs)의 계산 부담을 동적으로 줄이면서도 높은 작업 성능을 유지하는 효율적이고 학습이 필요 없는 프레임워크인 DyMU를 제안합니다. 우리의 접근 방식은 두 가지 주요 구성 요소로 이루어져 있습니다. 첫째, 동적 토큰 병합(DToMe)은 이미지 복잡도에 기반하여 유사한 토큰을 병합함으로써 시각 토큰 임베딩의 수를 줄여, 비전 트랜스포머의 고정 길이 출력에서 발생하는 본질적인 비효율성을 해결합니다. 둘째, 가상 토큰 병합 해제(VTU)는 전체 시퀀스의 어텐션 동역학을 효율적으로 재구성함으로써 대규모 언어 모델(LLMs)의 예상 토큰 시퀀스를 시뮬레이션하여, 추가적인 미세 조정 없이도 다운스트림 성능을 유지합니다. 기존 접근 방식과 달리, 우리의 방법은 이미지 내용에 따라 토큰 압축을 동적으로 조정하며 완전히 학습이 필요 없어, 대부분의 최신 VLM 아키텍처에 즉시 적용할 수 있습니다. 이미지 및 비디오 이해 작업에 대한 광범위한 실험을 통해 DyMU가 평균 시각 토큰 수를 32%-85% 줄이면서도 다양한 VLM 아키텍처(최근 인기를 끈 AnyRes 기반 시각 인코더 포함)에서 전체 길이 모델과 비슷한 성능을 달성할 수 있음을 입증했습니다. 또한, 정성적 분석을 통해 DToMe가 이미지 복잡도에 기반하여 토큰 감소를 효과적으로 조정하며, 기존 시스템과 달리 사용자가 계산 비용을 더 잘 제어할 수 있음을 보여줍니다. 프로젝트 페이지: https://mikewangwzhl.github.io/dymu/.

English

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

DyMU: 효율적인 시각-언어 모델을 위한 동적 병합 및 가상 병합 해제

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

초록

Support