ChatPaper.aiChatPaper

量化視覺幾何基礎變壓器

Quantized Visual Geometry Grounded Transformer

September 25, 2025
作者: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
cs.AI

摘要

以视觉几何基础变换器(VGGTs)为代表的学习型三维重建模型,借助大规模变换器的应用,已取得显著进展。然而,其高昂的计算与内存成本严重阻碍了实际部署。训练后量化(PTQ)已成为压缩与加速模型的常规手段。然而,我们通过实证发现,在压缩十亿级规模的VGGTs时,PTQ面临独特挑战:数据无关的特殊令牌导致激活分布呈现重尾特性,而三维数据的多视角特性使得校准样本选择极不稳定。本文首次提出针对VGGTs的量化框架,即QuantVGGT。该框架主要依赖两项技术贡献:首先,我们引入了双平滑细粒度量化,通过预全局哈达玛旋转与后局部通道平滑相结合,有效缓解重尾分布及通道间差异,增强鲁棒性。其次,我们设计了噪声过滤多样性采样,利用深层统计信息过滤异常值,并构建帧感知的多样化校准集群,确保量化范围的稳定性。全面实验表明,QuantVGGT在不同基准测试及比特宽度下均达到了当前最优结果,大幅超越先前最先进的通用量化方法。特别指出,我们的4位QuantVGGT在实际硬件推理中可实现3.7倍的内存缩减与2.5倍的加速,同时保持重建精度不低于全精度模型的98%。这充分展示了QuantVGGT在资源受限场景下的巨大优势与实用性。我们的代码已发布于https://github.com/wlfeng0509/QuantVGGT。
English
Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and 2.5times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.
PDF82September 26, 2025