VisualThink-VLA：視覺中間推理以實現高效且低延遲的視覺-語言-行動策略

摘要

近期研究開始為視覺-語言-行動（VLA）策略配備明確的中間推理能力。然而，在具身控制中，文字思維鏈並非理想選擇：與行動預測無關或弱文字資訊可能干擾預測結果，而自迴歸文字解碼則會為即時閉環執行增加過多延遲。我們提出VISUALTHINK-VLA，這是一個專為精準低延遲VLA策略設計的視覺中間推理框架。我們的核心引導理念是以有效視覺思維輔助行動：VISUALTHINK-VLA透過緊湊的視覺證據介面引導行動預測，既保留空間精確性，又避免解碼負擔。此外，為進一步提升效能與效率，VISUALTHINK-VLA採用特製的選擇性路由機制來學習視覺證據標記，能在維持高容量專精化的同時實現低延遲推理。我們也引入VisualEvidence-Kit，這套監督與稽核資源以VisualEvidence-Agent為核心，建構包含75.47萬組VLA指令的VisualEvidence-Set，用於路由監督與反事實忠誠度測試。在多重基準測試及真實機器人評估中，VISUALTHINK-VLA在多數基準上達到最高成功率，同時將推理增強基準原本數秒的延遲降至亞秒級。例如，在BridgeData V2上，其單步驟延遲從ECoT的8.377秒降至0.367秒，實現22.8倍加速比。

English

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.