量子化視覚幾何基盤型トランスフォーマー

要旨

学習ベースの3D再構成モデル、特にVisual Geometry Grounded Transformers（VGGT）を代表とするモデルは、大規模なトランスフォーマーの使用により著しい進歩を遂げてきました。しかし、その膨大な計算コストとメモリ使用量は、実世界での展開を大きく妨げています。ポストトレーニング量子化（PTQ）は、モデルの圧縮と高速化のための一般的な手法となっています。しかし、私たちは経験的に、ビリオンスケールのVGGTを圧縮する際にPTQが特有の課題に直面することを観察しました：データに依存しない特殊トークンが重い裾を持つ活性化分布を引き起こし、3Dデータのマルチビュー特性がキャリブレーションサンプルの選択を非常に不安定にします。本論文では、VGGTのための最初の量子化フレームワーク、すなわちQuantVGGTを提案します。これには主に2つの技術的貢献があります：第一に、Dual-Smoothed Fine-Grained Quantizationを導入し、事前のグローバルHadamard回転と事後のローカルチャネル平滑化を統合して、重い裾を持つ分布とチャネル間の分散を頑健に緩和します。第二に、Noise-Filtered Diverse Samplingを設計し、深層統計量を用いて外れ値をフィルタリングし、フレームを意識した多様なキャリブレーションクラスタを構築して、安定した量子化範囲を確保します。包括的な実験により、QuantVGGTがさまざまなベンチマークとビット幅において最先端の結果を達成し、従来の汎用量子化手法を大きく上回ることが示されました。特に、4ビットのQuantVGGTは、メモリ使用量を3.7倍削減し、実ハードウェアでの推論を2.5倍加速しながら、再構成精度をフル精度モデルの98%以上に維持できることを強調します。これは、リソースが制約されたシナリオにおけるQuantVGGTの大きな利点と実用性を示しています。私たちのコードはhttps://github.com/wlfeng0509/QuantVGGTで公開されています。

English

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and 2.5times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

量子化視覚幾何基盤型トランスフォーマー

Quantized Visual Geometry Grounded Transformer

要旨

Support