ThinkAct: 視覚-言語-行動推論のための強化型視覚潜在計画

要旨

ビジョン・ランゲージ・アクション（VLA）推論タスクでは、エージェントがマルチモーダルな指示を解釈し、長期的な計画を立て、動的な環境において適応的に行動する能力が求められます。既存のアプローチでは、VLAモデルをエンドツーエンドで訓練し、入力から直接アクションにマッピングするため、明示的な推論が欠如しており、複数ステップにわたる計画や複雑なタスクのバリエーションへの適応が妨げられています。本論文では、高レベルの推論と低レベルのアクション実行を強化された視覚的潜在計画によって橋渡しするデュアルシステムフレームワーク「ThinkAct」を提案します。ThinkActは、マルチモーダルな大規模言語モデル（LLM）を訓練し、目標達成と軌道の一貫性に基づくアクション整合型の視覚的報酬に導かれた具現化された推論計画を生成します。これらの推論計画は視覚的計画潜在変数に圧縮され、下流のアクションモデルを条件付けることで、ターゲット環境におけるロバストなアクション実行を実現します。具現化推論とロボット操作のベンチマークにおける広範な実験を通じて、ThinkActが複雑な具現化AIタスクにおいて、少数ショット適応、長期的計画、自己修正行動を可能にすることが実証されました。

English

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

ThinkAct: 視覚-言語-行動推論のための強化型視覚潜在計画

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

要旨

Support