QuantVLA:面向视觉-语言-动作模型的尺度校准后训练量化方法
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
February 23, 2026
作者: Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
cs.AI
摘要
视觉-语言-动作(VLA)模型通过统一感知、语言与控制功能来实现具身智能体,但面对模型规模向长时程和大容量主干网络扩展时急剧增长的计算与内存需求,其实际部署仍存在显著挑战。为突破这些瓶颈,我们提出QuantVLA——一种无需重新训练的训练后量化(PTQ)框架。据我们所知,这是首个面向VLA系统的PTQ方案,也是首个成功量化扩散变换器(DiT)动作头的技术。QuantVLA包含三项尺度校准组件:(1)选择性量化布局,将语言主干网络与DiT中所有线性层整数化,同时保持注意力投影层为浮点运算以维持原始算子调度;(2)注意力温度匹配,通过轻量级逐头缩放机制稳定注意力对数概率,并在推理时将其折叠至反量化尺度中;(3)输出头平衡,通过逐层残差接口校准缓解投影后能量漂移。该框架无需额外训练,仅需少量未标注校准数据,支持低比特权重与激活值的整数核运算,且保持模型架构不变。在LIBERO基准测试中,QuantVLA在典型VLA模型上不仅超越全精度基线模型的任务成功率,量化组件实现约70%的相对内存节省,端到端推理延迟加速1.22倍,为在严苛算力、内存与功耗约束下实现可扩展的低比特具身智能提供了可行路径。
English
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.