ChatPaper.aiChatPaper

Atlas:多尺度注意力機制提升長上下文圖像建模

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

March 16, 2025
作者: Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
cs.AI

摘要

高效建模大規模圖像一直是機器學習領域的長期挑戰。為此,我們引入了多尺度注意力機制(Multi-Scale Attention, MSA)。MSA基於兩個核心思想:(i) 多尺度表示 (ii) 雙向跨尺度通信。MSA創建了O(log N)個尺度,以逐步粗化的特徵來表示圖像,並利用交叉注意力在尺度間傳播信息。隨後,我們介紹了Atlas,這是一種基於MSA的新型神經網絡架構。我們證明,Atlas在高分辨率版本的ImageNet 100數據集上,顯著改善了長上下文圖像建模的計算性能權衡。在1024像素分辨率下,Atlas-B達到了91.04%的準確率,與ConvNext-B(91.92%)相當,但速度快了4.3倍。Atlas比FasterViT快2.95倍,準確率高出7.38%;比LongViT快2.25倍,準確率高出4.96%。與MambaVision-S相比,Atlas-S在1024px、2048px和4096px分辨率下分別實現了5%、16%和32%的更高準確率,同時保持了相似的運行時間。重現我們實驗的代碼及預訓練模型可在https://github.com/yalalab/atlas獲取。
English
Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Summary

AI-Generated Summary

PDF112March 19, 2025