적은 것이 충분할 때: 효율적인 이미지 표현을 위한 적응형 토큰 축소

초록

비전 인코더는 일반적으로 다수의 시각적 토큰을 생성하여 정보가 풍부한 표현을 제공하지만, 이는 계산 부담을 크게 증가시킵니다. 이는 생성된 모든 토큰이 동일한 가치를 지니는지, 아니면 일부를 제거하여 계산 비용을 줄이면서도 품질을 저하시키지 않을 수 있는지에 대한 질문을 제기합니다. 본 논문에서는 덜 중요한 특징은 더 중요한 특징으로부터 재구성될 수 있다는 아이디어를 바탕으로 특징 유용성을 결정하는 새로운 방법을 소개합니다. 우리는 이 개념을 오토인코더와 Gumbel-Softmax 선택 메커니즘을 통합하여 구현함으로써, 가장 유익한 시각적 토큰만을 식별하고 유지할 수 있도록 합니다. 우리의 접근 방식을 검증하기 위해, 우리의 방법으로 선택된 특징을 사용한 LLaVA-NeXT 모델의 성능을 무작위로 선택된 특징을 사용한 경우와 비교했습니다. OCR 기반 작업에서는 시각적 컨텍스트의 50% 이상을 제거해도 성능 저하가 최소화되는 반면, 동일한 비율의 특징을 무작위로 제거하면 모델의 성능이 크게 저하되는 것을 발견했습니다. 또한, 일반 도메인 작업에서는 토큰의 30%만 무작위로 유지하더라도 전체 시각적 토큰을 사용한 경우와 비슷한 성능을 달성할 수 있었습니다. 우리의 결과는 성능 저하 없이 확장 가능하고 낮은 오버헤드의 추론을 가능하게 하는 적응적이고 효율적인 다중모드 프루닝(multimodal pruning) 방향을 제시합니다.

English

Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

적은 것이 충분할 때: 효율적인 이미지 표현을 위한 적응형 토큰 축소

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

초록

Support