ChatPaper.aiChatPaper

FlashVGGT:基于压缩描述子注意力的高效可扩展视觉几何变换器

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

December 1, 2025
作者: Zipeng Wang, Dan Xu
cs.AI

摘要

基于多视角图像的3D重建是计算机视觉领域的核心挑战。近年来,前馈方法已成为传统逐场景优化技术的高效鲁棒替代方案。其中,视觉几何定位变换器(VGGT)等先进模型通过对所有图像令牌进行全局自注意力来捕捉空间关系,但该方法因自注意力的二次方复杂度及长图像序列产生的大量令牌而存在可扩展性不足的问题。本文提出FlashVGGT,通过基于描述符的注意力机制突破这一瓶颈。该方案不再对所有令牌实施稠密全局注意力,而是将每帧的空间信息压缩为紧凑的描述符令牌集合,随后通过完整图像令牌集与小型描述符集之间的交叉注意力计算全局关系,显著降低计算开销。此外,描述符的紧凑性支持通过分块递归机制实现长序列在线推理,该机制可复用历史分块的缓存描述符。实验表明,FlashVGGT在重建精度上与VGGT持平,但对1000张图像的推理时间降至VGGT的9.3%,并能有效扩展至超过3000张图像的长序列。项目页面详见https://wzpscott.github.io/flashvggt_page/。
English
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.
PDF11December 4, 2025