역학적 맹점 극복: VLA 모델을 위한 훈련 없는 속도 및 경로 보정

초록

비전-언어-행동(VLA) 모델은 고전적인 제어 패러다임을 넘어 놀라운 유연성과 일반화를 달성한다. 그러나 대부분의 주류 VLA는 단일 프레임 관찰 패러다임 하에서 훈련되어 구조적으로 시간적 동역학을 인지하지 못한다. 결과적으로, 이러한 모델은 동적 데이터셋에서 훈련되거나 미세 조정되더라도 비정상 시나리오에서 심각하게 성능이 저하된다. 기존 접근법은 값비싼 재훈련이 필요하거나, 지연 병목 현상과 행동 청크 간의 낮은 시간적 일관성으로 인해 어려움을 겪는다. 우리는 Pace-and-Path Correction(속도 및 경로 보정)을 제안한다. 이는 훈련 불필요의 폐쇄형 추론 시간 연산자로서 모든 청크 기반 행동 VLA에 적용 가능하다. 단일 이차 비용 함수로부터 공동 최소화는 직교적으로 두 개의 별도 채널로 분해되는 통합 솔루션을 산출한다. 속도 채널은 계획된 방향을 따라 실행을 압축하는 반면, 경로 채널은 직교 공간 오프셋을 적용하여 청크 윈도우 내에서 인지된 동역학을 공동으로 흡수한다. 우리는 움직임을 유일한 제어 변수로 분리하도록 설계된 포괄적인 진단 벤치마크인 MoveBench에서 접근법을 평가한다. 실험 결과는 우리 프레임워크가 최첨단 훈련 불필요 래퍼 및 동적 적응 방법을 일관되게 능가하며, 동적 전용 환경과 정적-동적 혼합 환경에서 각각 기본 VLA 모델 대비 성공률을 절대적으로 최대 28.8% 및 25.9% 향상시킴을 보여준다.

English

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.