TokenPacker: 다중모드 LLM을 위한 효율적인 시각적 프로젝터

초록

시각적 프로젝터는 멀티모달 대형 언어 모델(MLLM)에서 시각적 인코더와 대형 언어 모델(LLM) 간의 필수적인 연결 고리 역할을 합니다. 일반적으로 MLLM은 단순한 MLP(Multi-Layer Perceptron)를 사용하여 일대일 변환을 통해 모든 시각적 컨텍스트를 보존합니다. 그러나 고해상도 이미지를 다룰 때 시각적 토큰은 중복될 수 있으며 크게 증가할 수 있어 MLLM의 효율성을 크게 저하시킵니다. 최근 몇몇 연구에서는 리샘플러(resampler) 또는 추상화기(abstractor)를 도입하여 결과적인 시각적 토큰의 수를 줄이려고 시도했습니다. 그러나 이러한 방법은 더 세밀한 디테일을 포착하지 못하고 MLLM의 시각적 추론 능력을 약화시키는 문제가 있습니다. 본 연구에서는 이러한 문제를 해결하기 위해, 풍부한 특성을 주입하여 압축된 시각적 토큰을 생성하는 새로운 시각적 프로젝터를 제안합니다. 구체적으로, 먼저 시각적 특징을 저해상도 포인트 쿼리로 보간하여 전체적인 시각적 표현을 기반으로 제공합니다. 그런 다음, 고해상도 및 다중 수준의 지역 기반 단서를 세밀한 참조 키와 값으로 활용하는 지역-대-포인트 주입 모듈을 도입하여, 이들이 해당 지역 컨텍스트 내에서 완전히 흡수되도록 합니다. 이 단계는 거친 포인트 쿼리를 효과적으로 업데이트하여, 후속 LLM 추론을 위한 풍부한 쿼리로 변환합니다. 광범위한 실험을 통해 우리의 접근 방식이 시각적 토큰을 75%~89% 압축하면서도 다양한 벤치마크에서 비슷하거나 더 나은 성능을 달성하며, 훨씬 더 높은 효율성을 보여줌을 입증했습니다. 소스 코드는 https://github.com/CircleRadon/TokenPacker에서 확인할 수 있습니다.

English

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.

TokenPacker: 다중모드 LLM을 위한 효율적인 시각적 프로젝터

TokenPacker: Efficient Visual Projector for Multimodal LLM

초록

Support