압축을 위한 한눈에 보기: 대규모 시각-언어 모델을 위한 동적 시각 토큰 프루닝

초록

시각적 토큰 압축은 대규모 시각-언어 모델(LVLMs)이 고해상도 입력을 효율적으로 처리하는 데 있어 핵심적인 요소입니다. 일반적으로 고정된 압축 비율을 채택하는 기존 방법들은 다양한 복잡도의 장면에 적응하지 못해, 종종 정보가 풍부한 시각적 토큰을 제거하는 부정확한 가지치기를 초래하며 모델 성능을 저하시킵니다. 이 문제를 해결하기 위해, 우리는 인간의 인지 과정에서 영감을 받은 동적 가지치기 프레임워크인 GlimpsePrune를 소개합니다. 이 프레임워크는 데이터 기반의 '짧은 관찰(glimpse)'을 통해 답변 생성 전에 단일 순방향 전달로 관련 없는 시각적 토큰을 제거합니다. 이 접근법은 시각적 토큰의 92.6%를 제거하면서도 자유형 VQA(Visual Question Answering) 작업에서 기준 성능을 평균적으로 완전히 유지합니다. 또한, 감소된 계산 비용은 더 효과적인 미세 조정을 가능하게 하여, GlimpsePrune+는 기준 성능의 110%를 달성하면서도 유사하게 높은 가지치기 비율을 유지합니다. 우리의 연구는 더 강력하고 효율적인 LVLMs를 구축하는 새로운 방식을 제시합니다.

English

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

압축을 위한 한눈에 보기: 대규모 시각-언어 모델을 위한 동적 시각 토큰 프루닝

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

초록

Support