ThinkAct:通過強化視覺潛在規劃實現視覺-語言-行動推理
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
July 22, 2025
作者: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
cs.AI
摘要
視覺-語言-行動(VLA)推理任務要求智能體能夠解讀多模態指令,進行長時程規劃,並在動態環境中靈活行動。現有方法通常以端到端的方式訓練VLA模型,直接將輸入映射到行動,而缺乏顯式推理,這限制了它們在多步驟規劃或適應複雜任務變化的能力。本文提出ThinkAct,這是一個雙系統框架,通過強化視覺潛在規劃來橋接高層推理與低層行動執行。ThinkAct訓練一個多模態大語言模型,基於目標完成度和軌跡一致性,生成由強化行動對齊的視覺獎勵引導的具身推理計劃。這些推理計劃被壓縮成視覺計劃潛在,用於條件化下游行動模型,以在目標環境中實現穩健的行動執行。在具身推理和機器人操作基準上的大量實驗表明,ThinkAct能夠在複雜的具身AI任務中實現少樣本適應、長時程規劃和自我校正行為。
English
Vision-language-action (VLA) reasoning tasks require agents to interpret
multimodal instructions, perform long-horizon planning, and act adaptively in
dynamic environments. Existing approaches typically train VLA models in an
end-to-end fashion, directly mapping inputs to actions without explicit
reasoning, which hinders their ability to plan over multiple steps or adapt to
complex task variations. In this paper, we propose ThinkAct, a dual-system
framework that bridges high-level reasoning with low-level action execution via
reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate
embodied reasoning plans guided by reinforcing action-aligned visual rewards
based on goal completion and trajectory consistency. These reasoning plans are
compressed into a visual plan latent that conditions a downstream action model
for robust action execution on target environments. Extensive experiments on
embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct
enables few-shot adaptation, long-horizon planning, and self-correction
behaviors in complex embodied AI tasks.