優質令牌尋寶:視覺幾何變換器令牌選擇的 hitchhiker 指南
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
May 22, 2026
作者: Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski
cs.AI
摘要
视觉几何变换器已成为多视图三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于这些模型内部的全局注意力层,其计算成本随输入序列长度呈二次方增长,这限制了模型的可扩展性和效率。在本研究中,我们提出一种简单而通用的策略来应对这一挑战:限制全局注意力中每个查询所交互的键/值令牌数量。为了实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,识别应保留的帧。随后,帧内选择步骤进一步丢弃所选帧中的冗余令牌。我们的分析凸显了基于多样性的帧间选择策略的优势,该策略确保对场景的广泛覆盖。对于帧内选择,我们证明了分层稀疏化的必要性,选择过程由全局注意力模式的熵值引导。与现有解决方案相比,我们的方法在速度与精度权衡方面表现更优。大量实验表明,在包含500张图像的场景中,该方法可加速视觉几何变换器超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略在未来视觉几何变换器应用中的关键作用。项目网站详见:https://zsh2000.github.io/good-token-hunting.github.io。
English
Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.