ダイナミクス盲目性の克服：VLAモデルのための訓練不要なペース・経路補正

要旨

Vision-Language-Action（VLA）モデルは、古典的な制御パラダイムを超える顕著な柔軟性と汎化能力を実現している。しかし、既存のVLAの大半は単一フレーム観測パラダイムに基づいて訓練されており、構造的に時間的ダイナミクスを捉えられない。その結果、これらのモデルは動的データセットで訓練または微調整された場合でも、非定常的なシナリオにおいて性能が大幅に低下する。既存のアプローチは、高コストな再訓練を必要とするか、レイテンシのボトルネックやアクションチャンク間の時間的一貫性の低さといった課題を抱えている。本稿では、任意のチャンク化アクションVLAをラップする、訓練不要かつ閉形式の推論時演算子である「ペース・アンド・パス補正（Pace-and-Path Correction）」を提案する。単一の二次コスト関数から導かれる同時最小化により、直交的に分解された2つの独立したチャネルが得られる。ペースチャネルは計画方向に沿った実行を圧縮し、パスチャネルは直交する空間オフセットを適用することで、チャンクウィンドウ内で知覚されるダイナミクスを統合的に吸収する。提案手法は、動作のみを制御変数として分離する包括的診断ベンチマーク「MoveBench」上で評価した。実験結果は、本フレームワークが最先端の訓練不要ラッパーや動的適応手法を一貫して上回り、動的環境のみ、および静的・動的混合環境において、基本VLAモデルに対する成功率の絶対的な向上がそれぞれ最大28.8%、25.9%に達することを示している。

English

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.