AR-VLA：面向视觉-语言-动作模型的真正自回归动作专家

摘要

我们提出了一种独立的自回归（AR）动作专家模型，该模型以连续因果序列生成动作，同时以可刷新的视觉-语言前缀为条件。与现有的视觉-语言-动作（VLA）模型和扩散策略（它们会在每次新观测时重置时序上下文，并以被动方式预测动作）不同，我们的动作专家通过长期记忆维持自身历史，天然具备上下文感知能力。这种结构解决了快速控制与慢速推理之间的频率不匹配问题，支持运动句法的高效独立预训练，以及与重型感知骨干的模块化集成，从而自然确保跨帧生成时空一致的动作。为了同步这些异步混合的视觉-语言-动作模态，我们利用一种重锚定机制，在训练和推理过程中数学上考虑了感知滞后。在模拟和真实机器人操作任务上的实验表明，所提方法能够有效替代传统基于分块的动作头，适用于专家策略和通用策略。AR-VLA展现出更优的历史感知能力和显著更平滑的动作轨迹，同时保持或超越了最先进反应式VLA的任务成功率。总体而言，我们的工作引入了一种可扩展的、上下文感知的动作生成模式，为训练有效的机器人策略提供了稳健的结构基础。代码和视频见https://arvla.insait.ai。

English

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai