ChatPaper.aiChatPaper

令牌猎手:面向视觉几何Transformer的令牌选择漫游指南

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

May 22, 2026
作者: Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski
cs.AI

摘要

视觉几何变换器已成为多视角三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于模型中全局注意力层的存在,其计算成本随输入序列长度呈二次方增长,这限制了其可扩展性和效率。在本工作中,我们通过一种简单且通用的策略应对这一挑战:限制每个查询在全局注意力中交互的键/值令牌数量。为实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,以识别需要保留的帧。其次,帧内选择步骤进一步剔除所选帧中更多冗余令牌。我们的分析凸显了基于多样性的帧间选择策略的优势,它能确保场景的广泛覆盖。对于帧内选择,我们证明了需要进行层感知稀疏化,选择过程由全局注意力模式的熵引导。与现有解决方案相比,我们的方法提供了更优的速度-精度权衡。大量实验表明,该方法在包含500张图像的场景中使视觉几何变换器加速超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略在未来视觉几何变换器应用中的关键作用。我们的项目网站为:https://zsh2000.github.io/good-token-hunting.github.io。
English
Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.