ACoT-VLA: Actie Ketting-van-Gedachten voor Visie-Taal-Actie Modellen

Samenvatting

Vision-Language-Action (VLA)-modellen zijn naar voren gekomen als essentiële generalistische robotbeleidsregels voor uiteenlopende manipulatietaken, waarbij conventioneel wordt vertrouwd op het direct vertalen van multimodale invoer naar acties via Vision-Language Model (VLM)-inbeddingen. Recente vooruitgang heeft expliciete intermediaire redenering geïntroduceerd, zoals subtaakvoorspelling (taal) of doelbeeldsynthese (visie), om actiegeneratie te sturen. Deze tussentijdse redenering is echter vaak indirect en inherent beperkt in haar vermogen om de volledige, gedetailleerde informatie over te brengen die nodig is voor precieze actie-uitvoering. In plaats daarvan stellen wij dat de meest effectieve vorm van redenering er een is die rechtstreeks in de actieruimte delibereert. Wij introduceren Action Chain-of-Thought (ACoT), een paradigma waarbij het redeneerproces zelf wordt geformuleerd als een gestructureerde reeks grove actie-intenties die het uiteindelijke beleid sturen. In dit artikel stellen wij ACoT-VLA voor, een nieuwe architectuur die het ACoT-paradigma materialiseert. Specifiek introduceren wij twee complementaire componenten: een Explicit Action Reasoner (EAR) en een Implicit Action Reasoner (IAR). De eerste stelt grove referentietrajecten voor als expliciete redeneringsstappen op actieniveau, terwijl de laatste latente actie-priors extraheert uit interne representaties van multimodale invoer, die samen een ACoT vormen die de downstream actiekop conditioneert om gegrond beleidsleren mogelijk te maken. Uitgebreide experimenten in real-world en simulatie-omgevingen tonen de superioriteit van onze voorgestelde methode aan, die respectievelijk 98,5%, 84,1% en 47,4% behaalt op LIBERO, LIBERO-Plus en VLABench.

English

Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.

ACoT-VLA: Actie Ketting-van-Gedachten voor Visie-Taal-Actie Modellen

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Samenvatting

Support