量化视觉几何基础Transformer

摘要

以视觉几何基础Transformer（VGGTs）为代表的学习型三维重建模型，借助大规模Transformer的应用，已取得显著进展。然而，其高昂的计算与内存成本严重阻碍了实际部署。训练后量化（PTQ）已成为压缩和加速模型的常用手段。但我们在实践中发现，PTQ在压缩十亿级VGGTs时面临独特挑战：数据无关的特殊令牌导致激活分布呈现重尾特性，而三维数据的多视角特性使得校准样本选择极不稳定。本文首次提出针对VGGTs的量化框架——QuantVGGT，其核心依托于两项技术贡献：首先，我们引入了双平滑细粒度量化，通过预全局哈达玛旋转与后局部通道平滑相结合，有效缓解重尾分布及通道间差异；其次，设计了噪声过滤多样性采样，利用深层统计信息过滤异常值，并构建帧感知的多样化校准簇，确保量化范围的稳定性。全面实验表明，QuantVGGT在不同基准测试和比特宽度下均达到业界领先水平，大幅超越此前最先进的通用量化方法。特别指出，我们的4位QuantVGGT在真实硬件推理中实现了3.7倍的内存缩减和2.5倍的加速，同时保持重建精度不低于全精度模型的98%，充分展现了QuantVGGT在资源受限场景下的巨大优势与实用性。代码已发布于https://github.com/wlfeng0509/QuantVGGT。

English

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and 2.5times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

量化视觉几何基础Transformer

Quantized Visual Geometry Grounded Transformer

摘要

Support