ChatPaper.aiChatPaper

快速思维行动:基于可言语化潜在规划的高效视觉-语言-行动推理

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

January 14, 2026
作者: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
cs.AI

摘要

视觉-语言-动作(VLA)任务需对复杂视觉场景进行推理,并在动态环境中执行自适应动作。尽管近期研究表明显式思维链(CoT)能提升推理型VLA的泛化能力,但冗长的推理轨迹会导致高推理延迟。我们提出Fast-ThinkAct高效推理框架,通过可言语化的潜在推理实现紧凑而高性能的规划。该框架通过从教师模型蒸馏学习潜在CoT的高效推理,借助偏好引导目标对齐操作轨迹,从而迁移语言与视觉规划能力以实现具身控制。这种推理增强的策略学习有效衔接了紧凑推理与动作执行。在多样化具身操作与推理基准上的大量实验表明,Fast-ThinkAct在保持有效长程规划、少样本适应及故障恢复能力的同时,相比最先进的推理型VLA将推理延迟降低高达89.3%,并实现强劲性能。
English
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
PDF361January 16, 2026