俄羅斯套娃多模型

摘要

大型多模型（LMMs）如LLaVA在視覺-語言推理中展現出強大的性能。這些模型首先將圖像嵌入固定數量的視覺標記，然後將其餵入大型語言模型（LLM）。然而，這種設計導致在密集視覺場景（如高分辨率圖像和視頻）中產生過多的標記，導致效率低下。雖然存在標記修剪/合併方法，但它們為每個圖像生成單一長度的輸出，並且無法在信息密度和效率之間提供靈活性。受俄羅斯套娃概念的啟發，我們提出M3：Matryoshka多模型，該模型學習將視覺內容表示為捕獲多個粗到細細粒度信息的嵌套視覺標記集。我們的方法為LMMs提供了幾個獨特的好處：（1）可以在推斷期間明確控制每個測試實例的視覺細微度，例如，根據內容的預期複雜性或簡單性調整用於表示圖像的標記數量；（2）M3為分析現有數據集所需的細微度提供了一個框架，我們發現COCO風格的基準只需要約9個視覺標記即可獲得與使用所有576個標記相似的準確性；（3）我們的方法提供了一個基礎，可以探索在樣本級別上在性能和視覺標記長度之間的最佳折衷，我們的研究顯示神諭上限和當前固定比例表示之間存在著很大的差距。

English

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.