VisualThink-VLA:面向高效低延迟视觉-语言-动作策略的视觉中间推理
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
May 28, 2026
作者: Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang
cs.AI
摘要
近期研究开始为视觉-语言-动作(VLA)策略配备显式的中间推理环节。然而在具身控制任务中,基于文本的思维链并不适用:与动作预测无关或弱文本关联的信息会干扰决策,而自回归文本解码带来的延迟无法满足实时闭环控制需求。为此,我们提出VISUALTHINK-VLA——一种面向低延迟高精度VLA策略的视觉中间推理框架。我们的引导式设计哲学是以高效视觉思维驱动动作生成:VISUALTHINK-VLA通过紧凑的视觉证据接口来引导动作预测,该接口既保持了空间精度,又避免了解码开销。为进一步提升性能与效率,VISUALTHINK-VLA采用定制化的选择性路由机制学习视觉证据标记,在保持高容量专有性的同时实现低延迟推理。我们还推出了VisualEvidence-Kit监督与审计资源,其核心是VisualEvidence-Agent,该代理构建了包含75.47万条VLA指令的VisualEvidence-Set数据集,用于路由监督与反事实忠实度测试。在多项基准测试及真实机器人评估中,VISUALTHINK-VLA在大多数基准上取得最高成功率,并将推理增强基线模型的多秒级延迟降至亚秒级。例如在BridgeData V2数据集上,其单步延迟从ECoT模型的8.377秒降至0.367秒,实现22.8倍加速。
English
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.