MMTok: 시각언어 모델의 효율적 추론을 위한 다중모달 커버리지 최대화

초록

비전-언어 모델(VLMs)은 시각적 입력을 비전 토큰으로 변환하여 언어 지시와 함께 시각적 콘텐츠를 이해하는 데 있어 인상적인 성능을 보여줍니다. 그러나 비전 토큰의 중복성은 VLMs의 추론 효율성을 저하시키는 원인이 됩니다. 비전 토큰의 수를 줄이기 위해 많은 알고리즘이 제안되었지만, 대부분은 단일 모달 정보(즉, 비전/텍스트)만을 사용하여 토큰을 제거하고, 비전-언어 작업의 본질적인 다중 모달 특성을 무시합니다. 또한, 다양한 모달리티에 적용할 수 있는 일반적인 기준이 부족합니다. 이러한 한계를 완화하기 위해, 본 연구에서는 커버리지 기준을 통해 비전과 텍스트 토큰을 모두 활용하여 정보가 풍부한 비전 토큰을 선택하는 방법을 제안합니다. 먼저, 부분집합 선택 문제를 최대 커버리지 문제로 공식화합니다. 이후, 비전 토큰의 부분집합이 텍스트 토큰과 원래의 비전 토큰 집합을 동시에 커버하도록 최적화됩니다. 마지막으로, VLM 에이전트를 도입하여 비전 토큰 제거를 안내하는 텍스트 토큰의 품질을 더욱 향상시킬 수 있습니다. 제안된 방법인 MMTok은 다양한 VLMs을 사용한 벤치마크 데이터셋에서 광범위하게 평가되었습니다. 비교 결과, 비전과 텍스트 정보는 상호 보완적이며, 다중 모달 정보를 결합하면 단일 모달 기준선을 명확한 차이로 능가할 수 있음이 입증되었습니다. 또한, POPE 데이터셋에서 최대 커버리지 기준 하에, 우리의 방법은 LLaVA-NeXT-13B에서 원래 성능의 98.7%를 유지하면서 1.87배의 속도 향상을 달성했습니다. 더 나아가, 단 4개의 비전 토큰만으로도 LLaVA-1.5-7B에서 원래 성능의 87.7%를 보존합니다. 이러한 결과는 토큰 선택에서 커버리지의 효과성을 강조합니다.

English

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.

MMTok: 시각언어 모델의 효율적 추론을 위한 다중모달 커버리지 최대화

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

초록

Support