Atlas:多尺度注意力机制提升长上下文图像建模能力
Atlas: Multi-Scale Attention Improves Long Context Image Modeling
March 16, 2025
作者: Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
cs.AI
摘要
高效建模大规模图像一直是机器学习领域的一项长期挑战。为此,我们提出了多尺度注意力机制(MSA)。MSA基于两个核心理念:(i) 多尺度表示 (ii) 双向跨尺度通信。MSA通过创建O(log N)个尺度来逐步表示图像的粗糙特征,并利用交叉注意力机制在尺度间传播信息。随后,我们介绍了Atlas,一种基于MSA的新型神经网络架构。我们证明,Atlas在高分辨率版本的ImageNet 100上显著改善了长上下文图像建模的计算性能权衡。在1024像素分辨率下,Atlas-B实现了91.04%的准确率,与ConvNext-B(91.92%)相当,同时速度提升了4.3倍。与FasterViT相比,Atlas速度快2.95倍,准确率高出7.38%;与LongViT相比,速度快2.25倍,准确率高出4.96%。在与MambaVision-S的对比中,Atlas-S在1024px、2048px和4096px分辨率下分别实现了5%、16%和32%的更高准确率,同时保持了相近的运行时间。实验复现代码及预训练模型已发布于https://github.com/yalalab/atlas。
English
Efficiently modeling massive images is a long-standing challenge in machine
learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on
two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale
communication. MSA creates O(log N) scales to represent the image across
progressively coarser features and leverages cross-attention to propagate
information across scales. We then introduce Atlas, a novel neural network
architecture based on MSA. We demonstrate that Atlas significantly improves the
compute-performance tradeoff of long-context image modeling in a
high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves
91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster.
Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96%
better than LongViT. In comparisons against MambaVision-S, we find Atlas-S
achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px
respectively, while obtaining similar runtimes. Code for reproducing our
experiments and pretrained models is available at
https://github.com/yalalab/atlas.Summary
AI-Generated Summary