ACoT-VLA:視覺語言動作模型的行動思維鏈
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
January 16, 2026
作者: Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Maoqing Yao, Si Liu, Guanghui Ren
cs.AI
摘要
視覺-語言-動作(VLA)模型已成為應對多樣化操作任務的重要通用機器人策略,傳統上依賴透過視覺-語言模型(VLM)嵌入將多模態輸入直接轉換為動作。近期研究引入顯性中介推理(如子任務預測的語言推理或目標圖像合成的視覺推理)來指導動作生成。然而,這類中介推理往往具有間接性,且本質上難以傳遞精確動作執行所需的完整細粒度資訊。為此,我們提出最有效的推理形式應是直接在動作空間中進行決策。我們引入動作思維鏈(ACoT),該範式將推理過程本身構建為指導最終策略的結構化粗略動作意圖序列。本文提出ACoT-VLA架構,具體實現ACoT範式。我們設計兩個互補組件:顯性動作推理器(EAR)與隱性動作推理器(IAR)。前者生成粗略參考軌跡作為顯性動作層級推理步驟,後者從多模態輸入的內部表徵提取潛在動作先驗,共同形成ACoT以制約下游動作頭模組,實現具身化的策略學習。在真實環境與模擬環境中的大量實驗表明,我們的方法在LIBERO、LIBERO-Plus和VLABench數據集上分別達到98.5%、84.1%和47.4%的優異性能。
English
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.