ChatPaper.aiChatPaper

無限視覺幾何接地變換器:適用於無盡串流的視覺幾何基礎變換模型

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

January 5, 2026
作者: Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
cs.AI

摘要

實現持久性大規模三維視覺幾何理解的宏偉願景,長期受制於可擴展性與長期穩定性之間的不可調和矛盾。儘管離線模型(如VGGT)展現了令人振奮的幾何建模能力,但其批處理特性使其無法適用於即時系統。雖然串流架構是為即時運算設計的解決方案,但現有方法要么無法支援真正無限時長的輸入,要么在長序列中遭遇災難性的漂移問題。我們通過InfiniteVGGT突破這一長期困境——這是一種因果視覺幾何轉換器,通過有界卻自適應且持續具表達力的KV快取機制,實現了滾動記憶體的操作化。基於此,我們設計了無需訓練、與注意力機制無關的修剪策略,智能淘汰過時信息,隨著每幀新數據的輸入有效推進記憶體的「滾動」。完全兼容FlashAttention的InfiniteVGGT最終化解了此矛盾,在實現無限時長串流處理的同時,其長期穩定性更超越現有串流方法。對此類系統的終極考驗在於其面對真正無限時長的性能表現,而由於極長期連續基準數據的缺失,該能力一直無法被嚴格驗證。為填補這一關鍵空白,我們推出Long3D基準測試,首次實現對約10,000幀連續序列的三維幾何估計進行嚴謹評估,為未來長期三維幾何理解研究提供權威驗證平台。代碼已開源於:https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
English
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
PDF201January 7, 2026