PolicyTrim: 提升视觉-语言-动作模型的内在策略效率
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models
June 21, 2026
作者: Xianghui Wang, Feng Chen, Wenbo Zhang, Hua Yan, Zixuan Wang, Changsheng Li, Yinjie Lei
cs.AI
摘要
视觉-语言-动作(VLA)模型为机器人操作提供了统一范式,但其实际部署常受执行效率瓶颈制约。尽管现有工作主要聚焦于以计算为中心的效率优化,以减少每步推理延迟,但这些模型内在的策略效率仍未得到充分探索。策略效率根本上受两个因素影响:预测动作块的有效可执行长度,以及完成任务所需的总物理步数。这两个因素共同决定了执行过程中前向推理调用的总次数。我们观察到,当前的VLA策略在规划可靠性和动作冗余方面存在困难,其动作块尾部会出现严重的预测退化,并且倾向于生成不必要的冗余物理步。为解决这一问题,我们提出PolicyTrim——一种基于强化学习的后训练框架,它能够延长可靠的动作块长度并减少冗余物理步。在可靠动作块扩展方面,我们采用动态探索策略,明确奖励成功完成更长可执行长度的行为,逐步将可信预测范围推向其实验极限。在步效率优化方面,我们设计了一种冗余感知奖励,直接奖励以更少步骤成功完成任务的行为,同时惩罚不可复现的捷径,从而有效消除冗余物理动作。在三个基准测试和三种VLA模型上的大量实验表明,PolicyTrim将动作块利用率提升了3倍,并减少了51.4%的物理执行步数。最终,我们的框架在不影响任务成功率的前提下,实现了高达5.83倍的端到端部署加速。
English
Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3times and reduces physical execution steps by 51.4\%. Ultimately, our framework delivers up to a 5.83times end-to-end deployment speedup without compromising task success rates.