量化视觉几何基础Transformer
Quantized Visual Geometry Grounded Transformer
September 25, 2025
作者: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
cs.AI
摘要
以视觉几何基础Transformer(VGGTs)为代表的学习型三维重建模型,借助大规模Transformer的应用,已取得显著进展。然而,其高昂的计算与内存成本严重阻碍了实际部署。训练后量化(PTQ)已成为压缩和加速模型的常用手段。但我们在实践中发现,PTQ在压缩十亿级VGGTs时面临独特挑战:数据无关的特殊令牌导致激活分布呈现重尾特性,而三维数据的多视角特性使得校准样本选择极不稳定。本文首次提出针对VGGTs的量化框架——QuantVGGT,其核心依托于两项技术贡献:首先,我们引入了双平滑细粒度量化,通过预全局哈达玛旋转与后局部通道平滑相结合,有效缓解重尾分布及通道间差异;其次,设计了噪声过滤多样性采样,利用深层统计信息过滤异常值,并构建帧感知的多样化校准簇,确保量化范围的稳定性。全面实验表明,QuantVGGT在不同基准测试和比特宽度下均达到业界领先水平,大幅超越此前最先进的通用量化方法。特别指出,我们的4位QuantVGGT在真实硬件推理中实现了3.7倍的内存缩减和2.5倍的加速,同时保持重建精度不低于全精度模型的98%,充分展现了QuantVGGT在资源受限场景下的巨大优势与实用性。代码已发布于https://github.com/wlfeng0509/QuantVGGT。
English
Learning-based 3D reconstruction models, represented by Visual Geometry
Grounded Transformers (VGGTs), have made remarkable progress with the use of
large-scale transformers. Their prohibitive computational and memory costs
severely hinder real-world deployment. Post-Training Quantization (PTQ) has
become a common practice for compressing and accelerating models. However, we
empirically observe that PTQ faces unique obstacles when compressing
billion-scale VGGTs: the data-independent special tokens induce heavy-tailed
activation distributions, while the multi-view nature of 3D data makes
calibration sample selection highly unstable. This paper proposes the first
Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two
technical contributions: First, we introduce Dual-Smoothed Fine-Grained
Quantization, which integrates pre-global Hadamard rotation and post-local
channel smoothing to mitigate heavy-tailed distributions and inter-channel
variance robustly. Second, we design Noise-Filtered Diverse Sampling, which
filters outliers via deep-layer statistics and constructs frame-aware diverse
calibration clusters to ensure stable quantization ranges. Comprehensive
experiments demonstrate that QuantVGGT achieves the state-of-the-art results
across different benchmarks and bit-width, surpassing the previous
state-of-the-art generic quantization method with a great margin. We highlight
that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and
2.5times acceleration in real-hardware inference, while maintaining
reconstruction accuracy above 98\% of its full-precision counterpart. This
demonstrates the vast advantages and practicality of QuantVGGT in
resource-constrained scenarios. Our code is released in
https://github.com/wlfeng0509/QuantVGGT.