FlashVGGT：基于压缩描述子注意力的高效可扩展视觉几何变换器

摘要

基於多視角圖像的3維重建是計算機視覺領域的核心挑戰。近年來，前饋式方法已成為傳統逐場景優化技術的高效魯棒替代方案。其中，視覺幾何接地變換器（VGGT）等先進模型通過對所有圖像標記進行全自注意力計算來捕捉全局關係。然而，由於自注意力機制的二次方複雜度以及長圖像序列產生的大量標記，該方法存在可擴展性不足的問題。本文提出FlashVGGT，通過基於描述符的注意力機制有效解決這一瓶頸。該方法不再對所有標記施加密集的全局注意力，而是將每幀圖像的空間信息壓縮為緊湊的描述符標記集合，隨後通過完整圖像標記集與小型描述符集之間的交叉注意力實現全局關係計算，顯著降低了計算開銷。此外，描述符的緊湊性支持採用分塊遞歸機制實現長序列在線推理，可復用歷史分塊的緩存描述符。實驗結果表明：FlashVGGT在重建精度上與VGGT相當，但對於1000張圖像的推理時間僅需VGGT的9.3%，並能有效擴展至超過3000張圖像的長序列處理。項目頁面詳見：https://wzpscott.github.io/flashvggt_page/。

English

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.

FlashVGGT：基于压缩描述子注意力的高效可扩展视觉几何变换器

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

摘要

Support