流式4D视觉几何变换器
Streaming 4D Visual Geometry Transformer
July 15, 2025
作者: Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
cs.AI
摘要
从视频中感知并重建四维时空几何是一项基础而具挑战性的计算机视觉任务。为促进交互式与实时应用,我们提出了一种流式四维视觉几何变换器,其理念与自回归大语言模型相似。我们探索了一种简洁高效的设计,采用因果变换器架构以在线方式处理输入序列。通过运用时间因果注意力机制,并缓存历史键值作为隐式记忆,实现了高效的流式长期四维重建。该设计能够通过逐步整合历史信息,在保持高质量空间一致性的同时,处理实时四维重建任务。为提升训练效率,我们提出从密集双向视觉几何基础变换器(VGGT)中蒸馏知识至我们的因果模型。在推理阶段,我们的模型支持从大语言模型领域迁移优化后的高效注意力算子(如FlashAttention)。在多个四维几何感知基准上的广泛实验表明,我们的模型在在线场景中提升了推理速度,同时保持了竞争力,为可扩展且交互式的四维视觉系统铺平了道路。代码已发布于:https://github.com/wzzheng/StreamVGGT。
English
Perceiving and reconstructing 4D spatial-temporal geometry from videos is a
fundamental yet challenging computer vision task. To facilitate interactive and
real-time applications, we propose a streaming 4D visual geometry transformer
that shares a similar philosophy with autoregressive large language models. We
explore a simple and efficient design and employ a causal transformer
architecture to process the input sequence in an online manner. We use temporal
causal attention and cache the historical keys and values as implicit memory to
enable efficient streaming long-term 4D reconstruction. This design can handle
real-time 4D reconstruction by incrementally integrating historical information
while maintaining high-quality spatial consistency. For efficient training, we
propose to distill knowledge from the dense bidirectional visual geometry
grounded transformer (VGGT) to our causal model. For inference, our model
supports the migration of optimized efficient attention operator (e.g.,
FlashAttention) from the field of large language models. Extensive experiments
on various 4D geometry perception benchmarks demonstrate that our model
increases the inference speed in online scenarios while maintaining competitive
performance, paving the way for scalable and interactive 4D vision systems.
Code is available at: https://github.com/wzzheng/StreamVGGT.