양자화된 시각적 기하학 기반 트랜스포머

초록

Visual Geometry Grounded Transformers(VGGTs)로 대표되는 학습 기반 3D 재구성 모델은 대규모 트랜스포머의 사용으로 인해 놀라운 발전을 이루었습니다. 그러나 이들의 과도한 계산 및 메모리 비용은 실제 환경에서의 배포를 심각하게 저해합니다. 학습 후 양자화(Post-Training Quantization, PTQ)는 모델을 압축하고 가속화하기 위한 일반적인 방법론으로 자리 잡았습니다. 하지만 우리는 실험적으로, PTQ가 수십억 규모의 VGGTs를 압축할 때 독특한 장애물에 직면한다는 것을 관찰했습니다: 데이터 독립적인 특수 토큰들이 무거운 꼬리(heavy-tailed) 활성화 분포를 유발하는 반면, 3D 데이터의 다중 뷰(multi-view) 특성은 캘리브레이션 샘플 선택을 매우 불안정하게 만듭니다. 본 논문은 VGGTs를 위한 최초의 양자화 프레임워크인 QuantVGGT를 제안합니다. 이는 주로 두 가지 기술적 기여에 기반합니다: 첫째, 우리는 무거운 꼬리 분포와 채널 간 분산을 강력하게 완화하기 위해 전역적 하다마드 회전(pre-global Hadamard rotation)과 지역적 채널 평활화(post-local channel smoothing)를 통합한 이중 평활 세밀 양자화(Dual-Smoothed Fine-Grained Quantization)를 도입했습니다. 둘째, 우리는 딥 레이어 통계를 통해 이상치를 필터링하고 프레임 인식의 다양한 캘리브레이션 클러스터를 구성하여 안정적인 양자화 범위를 보장하는 노이즈 필터링 다양성 샘플링(Noise-Filtered Diverse Sampling)을 설계했습니다. 포괄적인 실험을 통해 QuantVGGT가 다양한 벤치마크와 비트 폭에서 최첨단 결과를 달성하며, 이전의 최첨단 일반 양자화 방법을 큰 차이로 능가함을 입증했습니다. 우리는 4비트 QuantVGGT가 실제 하드웨어 추론에서 3.7배의 메모리 감소와 2.5배의 가속화를 제공하면서도 재구성 정확도를 전체 정밀도 대비 98% 이상 유지할 수 있음을 강조합니다. 이는 자원이 제한된 시나리오에서 QuantVGGT의 막대한 이점과 실용성을 입증합니다. 우리의 코드는 https://github.com/wlfeng0509/QuantVGGT에서 공개되었습니다.

English

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7times memory reduction and 2.5times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

양자화된 시각적 기하학 기반 트랜스포머

Quantized Visual Geometry Grounded Transformer

초록

Support