AR-VLA: 시각-언어-행동 모델을 위한 진정한 자기회귀 행동 전문가

초록

우리는 지속적인 인과적 시퀀스로 행동을 생성하면서, 새로운 시각-언어 프리픽스(prefix)에 조건화된 독립형 자기회귀(AR) Action Expert를 제안한다. 기존의 시각-언어-행동(VLA) 모델이나 확산 정책(diffusion policy)이 새로운 관측에 따라 시간적 맥락을 초기화하고 반응적으로 행동을 예측하는 것과 달리, 우리의 Action Expert는 장기 메모리를 통해 자체적인 이력을 유지하며 본질적으로 맥락을 인지한다. 이러한 구조는 빠른 제어와 느린 추론 간의 빈도 불일치를 해결하며, 운동학적 통사론(kinematic syntax)의 효율적인 독립적 사전학습과 무거운 인식 백본(perception backbone)과의 모듈식 통합을 가능하게 하고, 프레임 간 시공간적으로 일관된 행동 생성을 자연스럽게 보장한다. 이러한 비동기적 혼합 V-L-A 양식을 동기화하기 위해, 훈련과 추론 모두에서 인식 지연(perception staleness)을 수학적으로 처리하는 재정착 메커니즘(re‑anchoring mechanism)을 활용한다. 시뮬레이션 및 실제 로봇 조작 작업에 대한 실험은 제안 방법이 전문가 및 범용 정책 모두에서 기존의 청크 기반 행동 헤드를 효과적으로 대체할 수 있음을 보여준다. AR-VLA는 최신 반응형 VLA와 동등하거나 더 높은 작업 성공률을 유지하면서, 우수한 이력 인식 능력과 현저히 부드러운 행동 궤적을 나타낸다. 종합적으로, 본 연구는 확장 가능하고 맥락을 인지하는 행동 생성 스키마를 제시하며, 이는 효과적인 로봇 정책 훈련을 위한 견고한 구조적 기반을 제공한다. 코드와 비디오는 https://arvla.insait.ai 에서 확인할 수 있다.

English

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai