循環深度視覺語言動作模型：透過潛在迭代推理實現隱性測試時計算擴展

摘要

當前視覺-語言-動作模型依賴固定計算深度，對簡單調整與複雜多步驟操作均消耗相同計算量。儘管思維鏈提示支持可變計算，但其記憶體需求呈線性增長，且難以適用於連續動作空間。我們提出循環深度VLA架構，通過潛在迭代優化而非顯式標記生成來實現計算自適應性。該模型採用權重共享的循環動作頭，在恆定記憶體佔用下支持任意推理深度。通過時間截斷反向傳播訓練，可有效監督優化過程。推理時，RD-VLA基於潛在狀態收斂的自適應停止準則動態分配計算量。在複雜操作任務上的實驗表明：單次迭代完全失敗的任務經過四次迭代成功率超過90%，而簡單任務則快速飽和。RD-VLA為機器人測試時計算提供了可擴展路徑，以潛在推理替代基於標記的推理，實現恆定記憶體使用量，並較先前基於推理的VLA模型最高提升80倍推理速度。項目頁面：https://rd-vla.github.io/

English

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/

循環深度視覺語言動作模型：透過潛在迭代推理實現隱性測試時計算擴展

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

摘要

Support