嵌套專家混合模型:視覺標記的適應處理
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
July 29, 2024
作者: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul
cs.AI
摘要
視覺媒體(圖像和影片)自然包含大量信息冗餘,因此提供了在處理中提高效率的絕佳機會。儘管基於視覺轉換器(Vision Transformer,ViT)的模型能夠有效擴展到大數據範疇,但它們未能充分利用這種固有的冗餘性,導致更高的計算成本。專家混合(Mixture of Experts,MoE)網絡展示了可擴展性,同時保持相同的推理時間成本,但它們帶有更大的參數占用量。我們提出了嵌套專家混合(Mixture of Nested Experts,MoNE),它利用專家的嵌套結構,其中個別專家位於一條遞增的計算-準確性曲線上。在給定計算預算的情況下,MoNE 學會動態地按優先順序選擇令牌,因此冗餘令牌通過更便宜的嵌套專家進行處理。通過這個框架,我們實現了與基準模型相當的性能,同時將推理時間計算量減少了一倍以上。我們在標準圖像和視頻數據集(ImageNet-21K、Kinetics400 和 Something-Something-v2)上驗證了我們的方法。我們進一步突出了 MoNE 的適應性,展示了它在視頻上僅使用單個訓練模型即能在不同推理時間計算預算下保持強大性能的能力。
English
The visual medium (images and videos) naturally contains a large amount of
information redundancy, thereby providing a great opportunity for leveraging
efficiency in processing. While Vision Transformer (ViT) based models scale
effectively to large data regimes, they fail to capitalize on this inherent
redundancy, leading to higher computational costs. Mixture of Experts (MoE)
networks demonstrate scalability while maintaining same inference-time costs,
but they come with a larger parameter footprint. We present Mixture of Nested
Experts (MoNE), which utilizes a nested structure for experts, wherein
individual experts fall on an increasing compute-accuracy curve. Given a
compute budget, MoNE learns to dynamically choose tokens in a priority order,
and thus redundant tokens are processed through cheaper nested experts. Using
this framework, we achieve equivalent performance as the baseline models, while
reducing inference time compute by over two-fold. We validate our approach on
standard image and video datasets - ImageNet-21K, Kinetics400, and
Something-Something-v2. We further highlight MoNE's adaptability by
showcasing its ability to maintain strong performance across different
inference-time compute budgets on videos, using only a single trained model.Summary
AI-Generated Summary