ThinkAct:基于强化视觉潜在规划的视觉-语言-动作推理
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
July 22, 2025
作者: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
cs.AI
摘要
视觉-语言-动作(VLA)推理任务要求智能体能够解读多模态指令,进行长时程规划,并在动态环境中自适应地执行动作。现有方法通常以端到端的方式训练VLA模型,直接将输入映射至动作,缺乏显式推理过程,这限制了其进行多步规划或适应复杂任务变体的能力。本文提出ThinkAct,一种双系统框架,通过强化视觉潜在规划,将高层推理与低层动作执行相连接。ThinkAct训练一个多模态大语言模型,以基于目标达成与轨迹一致性的动作对齐视觉奖励为指导,生成具身推理计划。这些推理计划被压缩为视觉计划潜在变量,用于条件化下游动作模型,从而在目标环境中实现稳健的动作执行。在具身推理与机器人操作基准上的大量实验表明,ThinkAct能够在复杂的具身人工智能任务中实现少样本适应、长时程规划及自我纠正行为。
English
Vision-language-action (VLA) reasoning tasks require agents to interpret
multimodal instructions, perform long-horizon planning, and act adaptively in
dynamic environments. Existing approaches typically train VLA models in an
end-to-end fashion, directly mapping inputs to actions without explicit
reasoning, which hinders their ability to plan over multiple steps or adapt to
complex task variations. In this paper, we propose ThinkAct, a dual-system
framework that bridges high-level reasoning with low-level action execution via
reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate
embodied reasoning plans guided by reinforcing action-aligned visual rewards
based on goal completion and trajectory consistency. These reasoning plans are
compressed into a visual plan latent that conditions a downstream action model
for robust action execution on target environments. Extensive experiments on
embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct
enables few-shot adaptation, long-horizon planning, and self-correction
behaviors in complex embodied AI tasks.