PolicyTrim: 視覚言語行動モデルの内在的な方策効率の向上

要旨

ビジョン・ランゲージ・アクション（VLA）モデルはロボット操作の統一的パラダイムを提供する一方、実環境への展開は実行効率によってしばしばボトルネックに直面する。既存の研究は主に計算中心の効率性、すなわち1ステップあたりの推論レイテンシ削減に注力しているが、これらのモデルが本質的に持つポリシー効率はほとんど未探求のままである。ポリシー効率は、予測されたアクションチュークの実効実行可能長と、タスク完了に必要な物理ステップの総数という2つの要因に根本的に影響される。これら2つの要因は実行中の前方推論呼び出しの総数を共同で決定する。我々は、現在のVLAポリシーが計画の信頼性低下と行動の冗長性に悩まされ、アクションチュークの末尾で深刻な予測劣化が生じ、不必要に冗長な物理ステップを生成する傾向があることを観測した。この問題に対処するため、我々はPolicyTrimを提案する。これは強化学習に基づくポストトレーニングフレームワークであり、信頼性のあるアクションチューク長を拡張し、冗長な物理ステップを削減する。信頼性のあるチャンク拡張のために、動的探索戦略を採用する。これはより長い実行可能長の成功完了に対して明示的に報酬を与え、信頼できる予測ホライズンを経験的な限界まで徐々に押し上げる。ステップ効率のために、冗長性認識報酬を設計する。これは少ないステップでタスクを成功裏に完了することを直接的に好み、再現不可能なショートカットを罰することで冗長な物理行動を効果的に排除する。3つのベンチマークと3つのVLAモデルにわたる広範な実験により、PolicyTrimはアクションチューク利用率を3倍に向上させ、物理実行ステップを51.4%削減することを示した。最終的に、我々のフレームワークはタスク成功率を損なうことなく、最大5.83倍のエンドツーエンド展開高速化を実現する。

English

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3times and reduces physical execution steps by 51.4\%. Ultimately, our framework delivers up to a 5.83times end-to-end deployment speedup without compromising task success rates.