AR-VLA: 視覚・言語・行動モデルのための真の自己回帰アクションエキスパート

要旨

本稿では、リフレッシュ可能な視覚-言語プレフィックスを条件としながら、連続的な因果系列としてアクションを生成する、スタンドアロンの自己回帰型（AR）アクションエキスパートを提案する。新たな観測が得られるたびに時間的コンテキストをリセットし、事後的にアクションを予測する既存のVision-Language-Action（VLA）モデルや拡散ポリシーとは異なり、本アクションエキスパートは長期的なメモリを通じて自身の履歴を保持し、本質的にコンテキスト認識型である。この構造は、高速な制御と低速な推論の間の周波数の不一致に対処し、運動学的シンタックスの効率的な独立事前学習と、重い知覚バックボーンとのモジュラー統合を可能にすることで、フレーム間で時空間的に一貫したアクション生成を自然に保証する。これらの非同期なハイブリッドV-L-Aモダリティを同期させるために、トレーニング時および推論時の両方で知覚の遅延を数学的に考慮する再アンカリングメカニズムを利用する。シミュレーションおよび実ロボット操作タスクにおける実験により、提案手法がスペシャリストポリシーとジェネラリストポリシーの両方において、従来のチャンクベースのアクションヘッドを効果的に置き換えられることが示された。AR-VLAは優れた履歴認識能力と顕著に滑らかなアクション軌跡を示し、最先端のリアクティブVLAと同等以上のタスク成功率を維持または上回る。以上より、本稿はスケーラブルでコンテキスト認識型のアクション生成スキーマを導入し、効果的なロボットポリシーの訓練に堅牢な構造的基盤を提供する。コードとビデオはhttps://arvla.insait.aiで公開されている。

English

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai