ACoT-VLA:面向视觉-语言-动作模型的行为思维链
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
January 16, 2026
作者: Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Maoqing Yao, Si Liu, Guanghui Ren
cs.AI
摘要
视觉-语言-动作(VLA)模型已成为处理多样化操作任务的重要通用机器人策略,其传统方法依赖于通过视觉语言模型(VLM)嵌入将多模态输入直接转换为动作。近期研究引入了显式中间推理机制,如子任务预测(语言)或目标图像合成(视觉),以指导动作生成。然而,这些中间推理往往具有间接性,且固有限制了传递精确动作执行所需完整细节信息的能力。对此,我们提出最有效的推理形式应是在动作空间内直接进行推演。我们引入动作思维链(ACoT)范式,将推理过程构建为引导最终策略的粗粒度动作意图结构化序列。本文提出ACoT-VLA这一实现ACoT范式的新型架构,具体引入两个互补组件:显式动作推理器(EAR)与隐式动作推理器(IAR)。前者通过提出粗粒度参考轨迹作为显式动作级推理步骤,后者则从多模态输入的内部表征中提取潜在动作先验,共同构成ACoT以指导下游动作头实现具身策略学习。在真实环境与仿真环境中的大量实验表明,我们所提方法在LIBERO、LIBERO-Plus和VLABench上分别达到98.5%、84.1%和47.4%的优异性能。
English
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.