ThinkAct：基于强化视觉潜在规划的视觉-语言-动作推理

摘要

视觉-语言-动作（VLA）推理任务要求智能体能够解读多模态指令，进行长时程规划，并在动态环境中自适应地执行动作。现有方法通常以端到端的方式训练VLA模型，直接将输入映射至动作，缺乏显式推理过程，这限制了其进行多步规划或适应复杂任务变体的能力。本文提出ThinkAct，一种双系统框架，通过强化视觉潜在规划，将高层推理与低层动作执行相连接。ThinkAct训练一个多模态大语言模型，以基于目标达成与轨迹一致性的动作对齐视觉奖励为指导，生成具身推理计划。这些推理计划被压缩为视觉计划潜在变量，用于条件化下游动作模型，从而在目标环境中实现稳健的动作执行。在具身推理与机器人操作基准上的大量实验表明，ThinkAct能够在复杂的具身人工智能任务中实现少样本适应、长时程规划及自我纠正行为。

English

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

ThinkAct：基于强化视觉潜在规划的视觉-语言-动作推理

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

摘要

Support