Shallow-π：基于流模型的视觉语言助手的知识蒸馏

摘要

随着机器人实时部署需求的日益增长，视觉-语言-动作模型亟需实现快速且端侧的推理能力。在当前VLA模型研究中，效率优化主要聚焦于令牌层面（如视觉令牌剪枝），而系统性减少Transformer层数的研究却鲜有关注。据我们所知，在知识蒸馏框架下对基于流预测的VLA模型进行层数压缩的探索尚属空白。本研究提出Shallow-pi——一种基于知识蒸馏的层压缩框架，通过将VLM主干网络和流式动作头的Transformer层数从18层锐减至6层，在标准操作基准测试中以成功率损失不足1%的代价实现推理速度提升逾两倍，确立了精简VLA模型的性能新标杆。尤为关键的是，我们在Jetson Orin和Jetson Thor边缘设备上，通过多机器人平台（包括仿人机器人系统）在复杂动态操作场景中的工业级实测验证了该方法的有效性。

English

The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision-language-action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention and, to the best of our knowledge, has not been explored for flow-based VLA models under knowledge distillation. In this work, we propose Shallow-pi, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 to 6 layers. Shallow-pi achieves over two times faster inference with less than one percent absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor across multiple robot platforms, including humanoid systems, in complex and dynamic manipulation scenarios.

Shallow-π：基于流模型的视觉语言助手的知识蒸馏

Shallow-π: Knowledge Distillation for Flow-based VLAs

摘要

Support