リカレント深度VLA：潜在的反復推論による視覚言語行動モデルの暗黙的テスト時計算スケーリング

要旨

現在のVision-Language-Action（VLA）モデルは固定の計算深度に依存しており、単純な調整と複雑な多段階操作に同じ量の計算リソースを消費している。Chain-of-Thought（CoT）プロンプティングは可変計算を可能にするが、メモリ使用量が線形に増加し、連続行動空間には不向きである。本研究では、明示的なトークン生成ではなく潜在的な反復改良によって計算適応性を実現するRecurrent-Depth VLA（RD-VLA）アーキテクチャを提案する。RD-VLAは重み共有型の再帰的行動ヘッドを採用し、一定のメモリ使用量で任意の推論深度をサポートする。このモデルは時間方向の truncated backpropagation（TBPTT）により訓練され、改良プロセスを効率的に監督する。推論時には、RD-VLAは潜在空間の収束に基づく適応的停止基準を用いて計算リソースを動的に割り当てる。困難なマニピュレーション課題による実験では、再帰的深度が決定的に重要であることが示された：単一反復推論では完全に失敗（0%成功率）する課題が、4反復では90%以上の成功率を達成し、より単純な課題では急速に飽和する。RD-VLAはトークンベースの推論を潜在推論に置き換えることで、ロボティクスにおけるテスト時計算のスケーラブルな道筋を提供し、一定のメモリ使用量を実現し、従来の推論ベースVLAモデル比で最大80倍の推論高速化を達成する。プロジェクトページ：https://rd-vla.github.io/

English

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/

リカレント深度VLA：潜在的反復推論による視覚言語行動モデルの暗黙的テスト時計算スケーリング

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

要旨

Support