ThinkAct：通過強化視覺潛在規劃實現視覺-語言-行動推理

摘要

視覺-語言-行動（VLA）推理任務要求智能體能夠解讀多模態指令，進行長時程規劃，並在動態環境中靈活行動。現有方法通常以端到端的方式訓練VLA模型，直接將輸入映射到行動，而缺乏顯式推理，這限制了它們在多步驟規劃或適應複雜任務變化的能力。本文提出ThinkAct，這是一個雙系統框架，通過強化視覺潛在規劃來橋接高層推理與低層行動執行。ThinkAct訓練一個多模態大語言模型，基於目標完成度和軌跡一致性，生成由強化行動對齊的視覺獎勵引導的具身推理計劃。這些推理計劃被壓縮成視覺計劃潛在，用於條件化下游行動模型，以在目標環境中實現穩健的行動執行。在具身推理和機器人操作基準上的大量實驗表明，ThinkAct能夠在複雜的具身AI任務中實現少樣本適應、長時程規劃和自我校正行為。

English

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

ThinkAct：通過強化視覺潛在規劃實現視覺-語言-行動推理

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

摘要

Support