VisualThink-VLA: 効果的かつ低レイテンシな視覚・言語・行動ポリシーのための視覚的中間推論

要旨

近年の研究では、視覚言語行動（VLA）ポリシーに明示的な中間推論を組み込む試みが始まっている。しかし、身体性制御においてテキストベースの連鎖思考（CoT）は適切ではない。すなわち、無関連あるいは弱いテキスト情報が行動予測を妨げる一方、自己回帰的なテキスト復号はリアルタイムのクローズドループ実行に対して過大な遅延を生じさせる。本稿では、正確かつ低遅延なVLAポリシーのための視覚的中間推論フレームワークであるVISUALTHINK-VLAを提案する。我々のブートストラップ哲学は、効果的な視覚的思考によって行動を導くことにある。すなわち、VISUALTHINK-VLAは、空間的精度を保持しつつ復号オーバーヘッドを回避するコンパクトな視覚的エビデンスインターフェースを通じて行動予測をブートストラップする。さらに、性能と効率を向上させるため、VISUALTHINK-VLAは選択的ルーティング機構を採用し、視覚的エビデンストークンを学習することで、高容量の特化を維持しながら低遅延な推論を実現する。また、監視と監査のためのリソースとしてVisualEvidence-Kitを導入する。これはVisualEvidence-Agentを中心として構築され、754.7k件のVLA命令からなるVisualEvidence-Setを作成し、経路の監視と反事実的忠実性テストを可能にする。複数のベンチマークおよび実ロボット評価を通じて、VISUALTHINK-VLAはほとんどのベンチマークで最高の成功率を達成するとともに、推論強化ベースラインが持つ数秒のレイテンシをサブ秒領域に短縮する。例えば、BridgeData V2では、ECoTのステップ遅延8.377秒を0.367秒に削減し、22.8倍の高速化を実現している。

English

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.