ThinkAct: 강화된 시각 잠재 계획을 통한 비전-언어-행동 추론

초록

비전-언어-행동(VLA) 추론 작업은 에이전트가 다중 모달 지시를 해석하고, 장기 계획을 수행하며, 동적 환경에서 적응적으로 행동할 것을 요구합니다. 기존 접근 방식은 일반적으로 VLA 모델을 종단 간(end-to-end) 방식으로 훈련시켜 입력을 직접 행동으로 매핑하며 명시적인 추론을 배제함으로써, 다단계 계획 수립이나 복잡한 작업 변형에 적응하는 능력을 저해합니다. 본 논문에서는 강화된 시각적 잠재 계획(visual latent planning)을 통해 고수준 추론과 저수준 행동 실행을 연결하는 이중 시스템 프레임워크인 ThinkAct를 제안합니다. ThinkAct는 다중 모달 대형 언어 모델(LLM)을 훈련시켜 목표 달성과 궤적 일관성을 기반으로 한 행동 정렬 시각적 보상에 의해 안내되는 구체화된 추론 계획을 생성합니다. 이러한 추론 계획은 시각적 계획 잠재(visual plan latent)로 압축되어, 대상 환경에서 강력한 행동 실행을 위한 하위 행동 모델을 조건화합니다. 구체화된 추론 및 로봇 조작 벤치마크에서의 광범위한 실험을 통해 ThinkAct가 복잡한 구체화 AI 작업에서 소샷 적응(few-shot adaptation), 장기 계획(long-horizon planning), 그리고 자기 수정(self-correction) 행동을 가능하게 함을 입증합니다.

English

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

ThinkAct: 강화된 시각 잠재 계획을 통한 비전-언어-행동 추론

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

초록

Support