QuantVLA:面向视觉-语言-动作模型的尺度校准后训练量化方法
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
February 23, 2026
作者: Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
cs.AI
摘要
視覺-語言-動作(VLA)模型雖能統一具身智能體的感知、語言與控制能力,但在實際部署中面臨計算與記憶體需求急遽增長的挑戰,尤其當模型擴展至更長時序與更大骨幹網絡時更為顯著。為突破這些瓶頸,我們提出QuantVLA——一種無需訓練的訓練後量化(PTQ)框架。據我們所知,這是首個針對VLA系統的PTQ方法,也是首個成功量化擴散變換器(DiT)動作頭的策略。QuantVLA包含三個尺度校準組件:(1)選擇性量化佈局:將語言骨幹與DiT中所有線性層整數化,同時保持注意力投影層為浮點運算以維持原始運算調度;(2)注意力溫度匹配:通過輕量級逐頭縮放機制穩定注意力邏輯值,並在推理時將其摺疊至反量化尺度中;(3)輸出頭平衡:通過逐層殘差接口校準緩解投影後的能量漂移。該框架無需額外訓練,僅需少量未標註校準數據,支持低比特權重與激活值的整數核運算,且不改變模型架構。在LIBERO數據集的代表性VLA模型測試中,QuantVLA不僅超越全精度基準的任務成功率,更在量化組件上實現約70%的相對記憶體節省,端到端推理延遲加速達1.22倍,為在嚴格計算、記憶體與功耗限制下實現可擴展的低比特具身智能提供了實用路徑。
English
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.