마트료시카 멀티모달 모델

초록

LLaVA와 같은 대규모 멀티모달 모델(LMMs)은 시각-언어 추론에서 강력한 성능을 보여주고 있습니다. 이러한 모델들은 먼저 이미지를 고정된 수의 시각적 토큰으로 임베딩한 후 이를 대규모 언어 모델(LLM)에 입력합니다. 그러나 이러한 설계는 고해상도 이미지 및 비디오와 같은 밀집된 시각적 시나리오에서 과도한 수의 토큰을 생성하여 큰 비효율성을 초래합니다. 토큰 프루닝/병합 방법이 존재하지만, 이들은 각 이미지에 대해 단일 길이의 출력을 생성하며 정보 밀도와 효율성 간의 균형을 유연하게 조정할 수 없습니다. 마트료시카 인형의 개념에서 영감을 받아, 우리는 M3: 마트료시카 멀티모달 모델을 제안합니다. 이 모델은 시각적 콘텐츠를 여러 단계의 거친 것에서 세밀한 것까지 정보를 포착하는 중첩된 시각적 토큰 집합으로 표현하는 방법을 학습합니다. 우리의 접근 방식은 LMMs에 대해 몇 가지 독특한 이점을 제공합니다: (1) 추론 중에 테스트 인스턴스별로 시각적 세분성을 명시적으로 제어할 수 있습니다. 예를 들어, 예상되는 콘텐츠의 복잡성 또는 단순성에 따라 이미지를 표현하는 데 사용되는 토큰 수를 조정할 수 있습니다; (2) M3는 기존 데이터셋에 필요한 세분성을 분석하기 위한 프레임워크를 제공하며, 우리는 COCO 스타일 벤치마크가 모든 576개의 토큰을 사용하는 것과 유사한 정확도를 얻기 위해 약 ~9개의 시각적 토큰만 필요하다는 것을 발견했습니다; (3) 우리의 접근 방식은 샘플 수준에서 성능과 시각적 토큰 길이 간의 최적의 균형을 탐구하기 위한 기반을 제공하며, 우리의 조사는 오라클 상한과 현재의 고정 크기 표현 사이에 큰 격차가 존재한다는 것을 보여줍니다.

English

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.