量化視覺幾何基礎變壓器

摘要

以视觉几何基础变换器（VGGTs）为代表的学习型三维重建模型，借助大规模变换器的应用，已取得显著进展。然而，其高昂的计算与内存成本严重阻碍了实际部署。训练后量化（PTQ）已成为压缩与加速模型的常规手段。然而，我们通过实证发现，在压缩十亿级规模的VGGTs时，PTQ面临独特挑战：数据无关的特殊令牌导致激活分布呈现重尾特性，而三维数据的多视角特性使得校准样本选择极不稳定。本文首次提出针对VGGTs的量化框架，即QuantVGGT。该框架主要依赖两项技术贡献：首先，我们引入了双平滑细粒度量化，通过预全局哈达玛旋转与后局部通道平滑相结合，有效缓解重尾分布及通道间差异，增强鲁棒性。其次，我们设计了噪声过滤多样性采样，利用深层统计信息过滤异常值，并构建帧感知的多样化校准集群，确保量化范围的稳定性。全面实验表明，QuantVGGT在不同基准测试及比特宽度下均达到了当前最优结果，大幅超越先前最先进的通用量化方法。特别指出，我们的4位QuantVGGT在实际硬件推理中可实现3.7倍的内存缩减与2.5倍的加速，同时保持重建精度不低于全精度模型的98%。这充分展示了QuantVGGT在资源受限场景下的巨大优势与实用性。我们的代码已发布于https://github.com/wlfeng0509/QuantVGGT。

English

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and 2.5times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

量化視覺幾何基礎變壓器

Quantized Visual Geometry Grounded Transformer

摘要

Support