ChatPaper.aiChatPaper

无限视觉几何接地变换器:面向无尽流数据的视觉几何基础Transformer

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

January 5, 2026
作者: Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
cs.AI

摘要

实现持久、大规模三维视觉几何理解的宏伟愿景,长期以来受制于可扩展性与长期稳定性之间的根本性矛盾。虽然VGGT等离线模型展现出卓越的几何建模能力,但其批处理特性使其无法应用于实时系统。流式架构虽为实时运算而生,却存在固有缺陷:现有方法或无法支持真正无限时长的输入序列,或在长时序上面临灾难性漂移问题。我们提出的InfiniteVGGT突破了这一困境,该因果视觉几何变换器通过有界且自适应、持续表达性的KV缓存机制,实现了滚动内存的可操作化。基于此,我们设计了一种免训练、注意力机制无关的剪枝策略,智能淘汰过时信息,随着每帧新数据的输入实现内存的"滚动"更新。该架构完全兼容FlashAttention,最终消解了传统权衡,在实现无限时长流式处理的同时,其长期稳定性超越现有流式方法。此类系统的终极考验在于无限时长下的性能表现,由于极度缺乏长时序连续基准测试,该能力一直无法被严格验证。为此我们推出Long3D基准测试集,首次实现对约10,000帧连续序列的三维几何估计进行严格评估,为长期三维几何理解研究提供了权威验证平台。代码已开源:https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
English
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
PDF201January 7, 2026