母巢多模型

摘要

大型多模态模型（LMMs）如LLaVA在视觉-语言推理中表现出色。这些模型首先将图像嵌入固定数量的视觉令牌中，然后将它们馈送到大型语言模型（LLM）中。然而，这种设计会导致在密集视觉场景（如高分辨率图像和视频）中出现过多的令牌，从而导致效率低下。虽然存在令牌修剪/合并方法，但它们为每个图像生成单一长度的输出，无法在信息密度与效率之间提供灵活性。受毛里俊卡娃娃概念启发，我们提出M3：毛里俊卡多模态模型，它学习将视觉内容表示为捕获多个粗粒度到细粒度信息的嵌套视觉令牌集。我们的方法为LMMs提供了几个独特的优势：（1）可以在推理过程中明确控制每个测试实例的视觉粒度，例如，根据内容的预期复杂性或简单性调整用于表示图像的令牌数量；（2）M3为分析现有数据集所需的粒度提供了一个框架，在这里我们发现，类似COCO的基准只需要大约9个视觉令牌就能获得与使用全部576个令牌相似的准确性；（3）我们的方法为在样本级别探索性能和视觉令牌长度之间的最佳权衡提供了基础，我们的调查显示，神谕上界和当前固定尺度表示之间存在很大差距。

English

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.