VisualThink-VLA：面向高效低延迟视觉-语言-动作策略的视觉中间推理

摘要

近期研究开始为视觉-语言-动作（VLA）策略配备显式的中间推理环节。然而在具身控制任务中，基于文本的思维链并不适用：与动作预测无关或弱文本关联的信息会干扰决策，而自回归文本解码带来的延迟无法满足实时闭环控制需求。为此，我们提出VISUALTHINK-VLA——一种面向低延迟高精度VLA策略的视觉中间推理框架。我们的引导式设计哲学是以高效视觉思维驱动动作生成：VISUALTHINK-VLA通过紧凑的视觉证据接口来引导动作预测，该接口既保持了空间精度，又避免了解码开销。为进一步提升性能与效率，VISUALTHINK-VLA采用定制化的选择性路由机制学习视觉证据标记，在保持高容量专有性的同时实现低延迟推理。我们还推出了VisualEvidence-Kit监督与审计资源，其核心是VisualEvidence-Agent，该代理构建了包含75.47万条VLA指令的VisualEvidence-Set数据集，用于路由监督与反事实忠实度测试。在多项基准测试及真实机器人评估中，VISUALTHINK-VLA在大多数基准上取得最高成功率，并将推理增强基线模型的多秒级延迟降至亚秒级。例如在BridgeData V2数据集上，其单步延迟从ECoT模型的8.377秒降至0.367秒，实现22.8倍加速。

English

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.