ChatPaper.aiChatPaper

嵌套专家混合模型:视觉标记的自适应处理

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

July 29, 2024
作者: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul
cs.AI

摘要

视觉媒体(图像和视频)自然包含大量信息冗余,因此提供了在处理中提高效率的绝佳机会。虽然基于视觉Transformer(ViT)的模型能够有效扩展到大数据范畴,但它们未能充分利用这种固有冗余,导致更高的计算成本。专家混合(MoE)网络展示了可扩展性,同时保持相同的推理时间成本,但它们具有更大的参数占用量。我们提出了嵌套专家混合(MoNE),它利用专家的嵌套结构,其中各个专家落在逐渐增加的计算-准确性曲线上。在给定计算预算的情况下,MoNE学会动态选择以优先顺序处理令牌,因此冗余令牌通过更便宜的嵌套专家进行处理。利用这一框架,我们实现了与基准模型相当的性能,同时将推理时间的计算量减少了一倍以上。我们在标准图像和视频数据集上验证了我们的方法 - ImageNet-21K、Kinetics400和Something-Something-v2。我们进一步强调了MoNE的适应性,展示了它在视频上在不同推理时间计算预算下保持强大性能的能力,仅使用单个训练模型。
English
The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE's adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.

Summary

AI-Generated Summary

PDF374November 28, 2024