量化視覺幾何基礎變壓器
Quantized Visual Geometry Grounded Transformer
September 25, 2025
作者: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
cs.AI
摘要
以视觉几何基础变换器(VGGTs)为代表的学习型三维重建模型,借助大规模变换器的应用,已取得显著进展。然而,其高昂的计算与内存成本严重阻碍了实际部署。训练后量化(PTQ)已成为压缩与加速模型的常规手段。然而,我们通过实证发现,在压缩十亿级规模的VGGTs时,PTQ面临独特挑战:数据无关的特殊令牌导致激活分布呈现重尾特性,而三维数据的多视角特性使得校准样本选择极不稳定。本文首次提出针对VGGTs的量化框架,即QuantVGGT。该框架主要依赖两项技术贡献:首先,我们引入了双平滑细粒度量化,通过预全局哈达玛旋转与后局部通道平滑相结合,有效缓解重尾分布及通道间差异,增强鲁棒性。其次,我们设计了噪声过滤多样性采样,利用深层统计信息过滤异常值,并构建帧感知的多样化校准集群,确保量化范围的稳定性。全面实验表明,QuantVGGT在不同基准测试及比特宽度下均达到了当前最优结果,大幅超越先前最先进的通用量化方法。特别指出,我们的4位QuantVGGT在实际硬件推理中可实现3.7倍的内存缩减与2.5倍的加速,同时保持重建精度不低于全精度模型的98%。这充分展示了QuantVGGT在资源受限场景下的巨大优势与实用性。我们的代码已发布于https://github.com/wlfeng0509/QuantVGGT。
English
Learning-based 3D reconstruction models, represented by Visual Geometry
Grounded Transformers (VGGTs), have made remarkable progress with the use of
large-scale transformers. Their prohibitive computational and memory costs
severely hinder real-world deployment. Post-Training Quantization (PTQ) has
become a common practice for compressing and accelerating models. However, we
empirically observe that PTQ faces unique obstacles when compressing
billion-scale VGGTs: the data-independent special tokens induce heavy-tailed
activation distributions, while the multi-view nature of 3D data makes
calibration sample selection highly unstable. This paper proposes the first
Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two
technical contributions: First, we introduce Dual-Smoothed Fine-Grained
Quantization, which integrates pre-global Hadamard rotation and post-local
channel smoothing to mitigate heavy-tailed distributions and inter-channel
variance robustly. Second, we design Noise-Filtered Diverse Sampling, which
filters outliers via deep-layer statistics and constructs frame-aware diverse
calibration clusters to ensure stable quantization ranges. Comprehensive
experiments demonstrate that QuantVGGT achieves the state-of-the-art results
across different benchmarks and bit-width, surpassing the previous
state-of-the-art generic quantization method with a great margin. We highlight
that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and
2.5times acceleration in real-hardware inference, while maintaining
reconstruction accuracy above 98\% of its full-precision counterpart. This
demonstrates the vast advantages and practicality of QuantVGGT in
resource-constrained scenarios. Our code is released in
https://github.com/wlfeng0509/QuantVGGT.