VisualThink-VLA: 효과적이고 저지연의 비전-언어-행동 정책을 위한 시각적 중간 추론

초록

최근 연구들은 비전-언어-행동(VLA) 정책에 명시적 중간 추론을 도입하기 시작했다. 그러나 실제 로봇 제어(embodied control)에서는 텍스트 기반 사고의 연쇄(chain-of-thought)가 적합하지 않다. 관련 없거나 텍스트 정보가 약한 내용은 행동 예측을 방해할 수 있으며, 자동회귀 텍스트 디코딩은 실시간 폐루프 실행에 필요한 지연 시간을 너무 증가시키기 때문이다. 우리는 정확하고 지연 시간이 짧은 VLA 정책을 위한 시각적 중간 추론 프레임워크인 VISUALTHINK-VLA를 제시한다. 우리의 부트스트래핑 철학은 효과적인 시각적 사고를 통해 행동을 안내하는 것이다. VISUALTHINK-VLA는 디코딩 오버헤드를 피하면서 공간적 정밀도를 유지하는 간결한 시각적 증거 인터페이스를 통해 행동 예측을 부트스트래핑한다. 또한 성능과 효율성을 더욱 개선하기 위해, VISUALTHINK-VLA는 맞춤형 선택적 라우팅 메커니즘을 채택하여 시각적 증거 토큰을 학습함으로써 높은 용량의 특화를 유지하면서 지연 시간이 짧은 추론을 가능하게 한다. 우리는 또한 VisualEvidence-Kit를 소개한다. 이는 VisualEvidence-Agent를 중심으로 한 감독 및 감리 자원으로, 경로 감독 및 반사실적 충실성 테스트를 위해 754.7k개의 VLA 명령어로 구성된 VisualEvidence-Set을 구축한다. 여러 벤치마크와 실제 로봇 평가에서 VISUALTHINK-VLA는 대부분의 벤치마크에서 가장 높은 성공률을 달성하면서, 추론 강화 베이스라인의 수 초 지연 시간을 1초 미만으로 줄였다. 예를 들어, BridgeData V2에서 ECoT의 스텝당 지연 시간을 8.377초에서 0.367초로 줄여 22.8배의 속도 향상을 달성했다.

English

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.