Fast-ThinkAct:透過可言語化的潛在規劃實現高效視覺-語言-行動推理
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
January 14, 2026
作者: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
cs.AI
摘要
視覺-語言-行動(VLA)任務需在複雜視覺場景中進行推理,並於動態環境中執行適應性動作。儘管近期研究表明顯性思維鏈(CoT)能提升推理型VLA的泛化能力,但冗長的推理軌跡導致其推論延遲居高不下。我們提出Fast-ThinkAct框架,透過可言語化的潛在推理實現緊湊而高效的規劃。該框架通過從教師模型提煉知識,以偏好引導的目標驅動學習潛在CoT推理,藉此對齊操作軌跡,同時遷移語言與視覺規劃能力以實現具身控制。這種機制使推理增強的策略學習能有效串聯緊湊推理與動作執行。在多樣化具身操作與推理基準測試中,Fast-ThinkAct以較現有最先進推理VLA降低89.3%推論延遲的優勢達成強勁性能,同時保持有效的長程規劃、少樣本適應及錯誤恢復能力。
English
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.