LoopViT: Schaalvergroting van Visuele ARC met Geluste Transformers

Samenvatting

Recente vooruitgang in visueel redeneren heeft vision transformers ingezet om de ARC-AGI-benchmark aan te pakken. Wij stellen echter dat de feed-forward architectuur, waarbij de computationele diepte strikt gebonden is aan de parameteromvang, tekortschiet in het vatten van het iteratieve, algoritmische karakter van menselijke inductie. In dit werk stellen we een recursieve architectuur voor, genaamd Loop-ViT, die de redeneerdiepte ontkoppelt van de modelcapaciteit door middel van gewichtsgebonden recurrentie. Loop-ViT itereert een gewichtsgebonden Hybride Blok, dat lokale convoluties en globale aandacht combineert, om een latente keten van gedachten te vormen. Cruciaal is dat we een parameter-vrij Dynamisch Uitstapmechanisme introduceren, gebaseerd op voorspellende entropie: het model stopt de inferentie wanneer zijn interne toestand "kristalliseert" in een aantrekker met lage onzekerheid. Empirische resultaten op de ARC-AGI-1 benchmark valideren dit perspectief: ons model met 18M parameters behaalt een nauwkeurigheid van 65,8% en presteert daarmee beter dan massieve ensembles met 73M parameters. Deze bevindingen tonen aan dat adaptieve iteratieve berekening een veel efficiëntere schaalas biedt voor visueel redeneren dan simpelweg de netwerkbreedte te vergroten. De code is beschikbaar op https://github.com/WenjieShu/LoopViT.

English

Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.

LoopViT: Schaalvergroting van Visuele ARC met Geluste Transformers

LoopViT: Scaling Visual ARC with Looped Transformers

Samenvatting

Support