Atlas: Multi-schaal aandacht verbetert modellering van afbeeldingen met lange context

Samenvatting

Het efficiënt modelleren van enorme afbeeldingen is een lang bestaande uitdaging in machine learning. Daarom introduceren we Multi-Scale Attention (MSA). MSA steunt op twee kernideeën: (i) multi-schaal representaties en (ii) bidirectionele communicatie tussen schalen. MSA creëert O(log N) schalen om de afbeelding weer te geven met steeds grovere kenmerken en maakt gebruik van cross-attention om informatie tussen schalen te verspreiden. Vervolgens introduceren we Atlas, een nieuwe neurale netwerkarchitectuur gebaseerd op MSA. We tonen aan dat Atlas de rekentijd-prestatieverhouding van lang-context afbeeldingsmodellering aanzienlijk verbetert in een hoog-resolutie variant van ImageNet 100. Bij een resolutie van 1024px behaalt Atlas-B een nauwkeurigheid van 91,04%, vergelijkbaar met ConvNext-B (91,92%) terwijl het 4,3x sneller is. Atlas is 2,95x sneller en 7,38% beter dan FasterViT, en 2,25x sneller en 4,96% beter dan LongViT. In vergelijking met MambaVision-S, behaalt Atlas-S respectievelijk 5%, 16% en 32% hogere nauwkeurigheid bij 1024px, 2048px en 4096px, terwijl het vergelijkbare uitvoeringstijden heeft. Code voor het reproduceren van onze experimenten en vooraf getrainde modellen is beschikbaar op https://github.com/yalalab/atlas.

English

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Atlas: Multi-schaal aandacht verbetert modellering van afbeeldingen met lange context

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

Samenvatting

Support