Atlas: 멀티스케일 어텐션이 긴 문맥 이미지 모델링을 개선한다

초록

대규모 이미지를 효율적으로 모델링하는 것은 머신러닝 분야에서 오랜 기간 동안 해결해야 할 과제로 남아 있습니다. 이를 위해 우리는 Multi-Scale Attention(MSA)을 제안합니다. MSA는 두 가지 핵심 아이디어, 즉 (i) 멀티스케일 표현과 (ii) 양방향 교차 스케일 통신에 기반합니다. MSA는 O(log N) 스케일을 생성하여 점점 더 거친 특징을 통해 이미지를 표현하고, 교차 어텐션을 활용해 스케일 간 정보를 전파합니다. 이어서 MSA를 기반으로 한 새로운 신경망 아키텍처인 Atlas를 소개합니다. 우리는 Atlas가 고해상도 ImageNet 100 변형에서 장문맥 이미지 모델링의 계산-성능 트레이드오프를 크게 개선함을 입증합니다. 1024px 해상도에서 Atlas-B는 91.04% 정확도를 달성하며, ConvNext-B(91.92%)와 비슷한 성능을 보이면서도 4.3배 더 빠릅니다. Atlas는 FasterViT보다 2.95배 빠르고 7.38% 더 우수하며, LongViT보다 2.25배 빠르고 4.96% 더 우수합니다. MambaVision-S와 비교했을 때, Atlas-S는 1024px, 2048px, 4096px에서 각각 5%, 16%, 32% 더 높은 정확도를 달성하면서도 유사한 실행 시간을 보입니다. 우리의 실험을 재현할 수 있는 코드와 사전 학습된 모델은 https://github.com/yalalab/atlas에서 확인할 수 있습니다.

English

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Atlas: 멀티스케일 어텐션이 긴 문맥 이미지 모델링을 개선한다

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

초록

Support