ChatPaper.aiChatPaper

循环深度视觉语言动作模型:通过潜在迭代推理实现隐式测试时计算扩展

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

February 8, 2026
作者: Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, Ranjay Krishna
cs.AI

摘要

当前视觉-语言-动作(VLA)模型采用固定计算深度,对简单调整和复杂多步操作均消耗相同计算量。虽然思维链(CoT)提示支持可变计算,但其内存需求线性增长且难以适应连续动作空间。我们提出循环深度VLA(RD-VLA)架构,通过潜在迭代优化而非显式令牌生成实现计算自适应性。该模型采用权重共享的循环动作头,在恒定内存占用下支持任意推理深度。通过时间截断反向传播(TBPTT)训练,可有效监督优化过程。推理时,RD-VLA基于潜在收敛的自适应停止准则动态分配计算。在复杂操作任务上的实验表明:单次迭代完全失败(成功率0%)的任务经过四次迭代后成功率超过90%,而简单任务则快速饱和。RD-VLA通过潜在推理替代基于令牌的推理,在机器人领域实现了恒定内存占用,相比基于推理的VLA模型推理速度提升高达80倍,为测试时计算提供了可扩展路径。项目页面:https://rd-vla.github.io/
English
Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/
PDF642February 11, 2026