LoopViT:基于循环Transformer的视觉ARC扩展架构
LoopViT: Scaling Visual ARC with Looped Transformers
February 2, 2026
作者: Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang
cs.AI
摘要
近期视觉推理领域的研究开始采用视觉Transformer应对ARC-AGI基准测试。然而我们认为,计算深度严格受限于参数规模的前馈架构,难以捕捉人类归纳思维中具有迭代特质的算法本质。本文提出名为Loop-ViT的递归架构,通过权重共享的循环机制实现推理深度与模型容量的解耦。该架构通过迭代执行权重共享的混合模块(融合局部卷积与全局注意力机制),形成潜在思维链。关键创新在于基于预测熵的无参数动态退出机制:当模型内部状态"结晶"为低不确定性的吸引子时,推理过程自动终止。在ARC-AGI-1基准上的实验验证了该观点:我们的1800万参数模型达到65.8%准确率,优于7300万参数的大型集成模型。这些发现表明,自适应迭代计算为视觉推理提供了比单纯增加网络宽度更高效的扩展路径。代码已开源:https://github.com/WenjieShu/LoopViT。
English
Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.