ChatPaper.aiChatPaper

AR-VLA:面向视觉-语言-动作模型的真正自回归动作专家

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

May 11, 2026
作者: Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, Danda Paudel
cs.AI

摘要

我们提出了一种独立的自回归(AR)动作专家模型,该模型以连续因果序列生成动作,同时以可刷新的视觉-语言前缀为条件。与现有的视觉-语言-动作(VLA)模型和扩散策略(它们会在每次新观测时重置时序上下文,并以被动方式预测动作)不同,我们的动作专家通过长期记忆维持自身历史,天然具备上下文感知能力。这种结构解决了快速控制与慢速推理之间的频率不匹配问题,支持运动句法的高效独立预训练,以及与重型感知骨干的模块化集成,从而自然确保跨帧生成时空一致的动作。为了同步这些异步混合的视觉-语言-动作模态,我们利用一种重锚定机制,在训练和推理过程中数学上考虑了感知滞后。在模拟和真实机器人操作任务上的实验表明,所提方法能够有效替代传统基于分块的动作头,适用于专家策略和通用策略。AR-VLA展现出更优的历史感知能力和显著更平滑的动作轨迹,同时保持或超越了最先进反应式VLA的任务成功率。总体而言,我们的工作引入了一种可扩展的、上下文感知的动作生成模式,为训练有效的机器人策略提供了稳健的结构基础。代码和视频见https://arvla.insait.ai。
English
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai