マトリョーシカマルチモーダルモデル

要旨

LLaVAなどの大規模マルチモーダルモデル（LMM）は、視覚と言語の推論において優れた性能を示しています。これらのモデルは、まず画像を固定数のビジュアルトークンに埋め込み、その後それらを大規模言語モデル（LLM）に入力します。しかし、この設計は高解像度の画像やビデオなどの密集した視覚シナリオにおいて過剰な数のトークンを生成し、大きな非効率性を引き起こします。トークンのプルーニングやマージ手法は存在するものの、それらは各画像に対して単一の長さの出力を生成し、情報密度と効率性のトレードオフにおける柔軟性を提供しません。マトリョーシカ人形の概念に着想を得て、我々はM3: Matryoshka Multimodal Modelsを提案します。これは、視覚コンテンツを複数の粗から細かい粒度にわたって情報を捉えるネストされたビジュアルトークンのセットとして表現することを学習します。我々のアプローチはLMMに対して以下のような独自の利点を提供します：（1）推論時に各テストインスタンスごとに視覚粒度を明示的に制御できる。例えば、コンテンツの予想される複雑さや単純さに基づいて画像を表現するために使用するトークン数を調整できる。（2）M3は、既存のデータセットに必要な粒度を分析するためのフレームワークを提供し、COCOスタイルのベンチマークでは約9個のビジュアルトークンで576個のトークンを使用した場合と同様の精度が得られることを発見した。（3）我々のアプローチは、サンプルレベルでの性能とビジュアルトークン長の最適なトレードオフを探るための基盤を提供し、調査の結果、オラクルの上限と現在の固定スケール表現との間に大きなギャップが存在することが明らかになった。

English

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.