克服動態盲點：針對VLA模型的無需訓練速度與路徑校正

摘要

視覺-語言-動作（VLA）模型展現出超越傳統控制典範的顯著靈活性與泛化能力。然而，現行多數VLA模型均於單幀觀測範式下訓練，使其在結構上對時序動態缺乏感知。因此，即便在動態數據集上訓練或微調，此類模型在非平穩場景中的表現仍嚴重退化。現有方法若非需耗費大量成本重新訓練，便會遭遇延遲瓶頸與動作區塊間時序一致性不佳的問題。我們提出「步調與路徑校正」法，這是一種免訓練、封閉形式的推理時運算元，可套用於任何基於區塊動作的VLA模型。透過單一二次代價函數進行聯合最小化，可得出一個統一解，該解正交分解為兩個截然不同的通道：步調通道沿規劃方向壓縮執行動作，而路徑通道則施加正交空間偏移，兩者共同吸收區塊時間窗內感知到的動態。我們在專為隔離「運動」作為唯一控制變量而設計的綜合診斷基準MoveBench上評估本方法。實證結果顯示，我們的框架持續優於最新免訓練封裝器與動態自適應方法，在純動態環境及靜態-動態混合環境中，分別使基礎VLA模型的成功率絕對值提升最高達28.8%與25.9%。

English

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.