ChatPaper.aiChatPaper

VisualThink-VLA:視覺中間推理以實現高效且低延遲的視覺-語言-行動策略

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

May 28, 2026
作者: Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang
cs.AI

摘要

近期研究開始為視覺-語言-行動(VLA)策略配備明確的中間推理能力。然而,在具身控制中,文字思維鏈並非理想選擇:與行動預測無關或弱文字資訊可能干擾預測結果,而自迴歸文字解碼則會為即時閉環執行增加過多延遲。我們提出VISUALTHINK-VLA,這是一個專為精準低延遲VLA策略設計的視覺中間推理框架。我們的核心引導理念是以有效視覺思維輔助行動:VISUALTHINK-VLA透過緊湊的視覺證據介面引導行動預測,既保留空間精確性,又避免解碼負擔。此外,為進一步提升效能與效率,VISUALTHINK-VLA採用特製的選擇性路由機制來學習視覺證據標記,能在維持高容量專精化的同時實現低延遲推理。我們也引入VisualEvidence-Kit,這套監督與稽核資源以VisualEvidence-Agent為核心,建構包含75.47萬組VLA指令的VisualEvidence-Set,用於路由監督與反事實忠誠度測試。在多重基準測試及真實機器人評估中,VISUALTHINK-VLA在多數基準上達到最高成功率,同時將推理增強基準原本數秒的延遲降至亞秒級。例如,在BridgeData V2上,其單步驟延遲從ECoT的8.377秒降至0.367秒,實現22.8倍加速比。
English
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.