LoopViT:基于循环Transformer的视觉架构缩放
LoopViT: Scaling Visual ARC with Looped Transformers
February 2, 2026
作者: Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang
cs.AI
摘要
近期视觉推理研究利用视觉变换器攻克ARC-AGI基准测试,但我们认为前馈架构(其计算深度严格受参数规模限制)难以捕捉人类归纳的迭代式算法特性。本文提出名为Loop-ViT的递归架构,通过权重绑定的循环机制解耦推理深度与模型容量。该模型通过迭代执行融合局部卷积与全局注意力的权重绑定混合模块,构建潜在思维链。关键创新在于基于预测熵的无参数动态退出机制:当模型内部状态"结晶"为低不确定性吸引子时自动终止推理。在ARC-AGI-1基准上的实验验证了该观点:我们的1800万参数模型以65.8%准确率超越7300万参数集成模型。这些发现表明,自适应迭代计算为视觉推理提供了比单纯增加网络宽度更高效的扩展路径。代码已开源:https://github.com/WenjieShu/LoopViT。
English
Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.