串流式四維視覺幾何變換器

摘要

從影片中感知並重建四維時空幾何是一項基礎但具挑戰性的電腦視覺任務。為促進互動與即時應用，我們提出了一種流式四維視覺幾何變換器，其理念與自回歸大型語言模型相似。我們探索了一種簡潔高效的設計，採用因果變換器架構以線上方式處理輸入序列。我們使用時間因果注意力機制，並將歷史鍵值對作為隱含記憶進行快取，從而實現高效的流式長期四維重建。此設計能夠通過逐步整合歷史資訊來處理即時四維重建，同時保持高質量的空間一致性。為實現高效訓練，我們提出從密集雙向視覺幾何基礎變換器（VGGT）中蒸餾知識至我們的因果模型。在推理階段，我們的模型支持從大型語言模型領域遷移優化的高效注意力算子（如FlashAttention）。在多項四維幾何感知基準上的廣泛實驗表明，我們的模型在線上場景中提升了推理速度，同時保持了競爭力的性能，為可擴展且互動的四維視覺系統鋪平了道路。程式碼可於以下網址取得：https://github.com/wzzheng/StreamVGGT。

English

Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

串流式四維視覺幾何變換器

Streaming 4D Visual Geometry Transformer

摘要

Support