세계 모델링이 더 나은 플래너를 만든다: 구체화된 작업 계획을 위한 이중 선호 최적화

초록

대규모 시각-언어 모델(LVLMs)의 최근 발전은 구체화된 작업 계획에 있어 유망한 가능성을 보여주었지만, 여전히 의존성 제약과 효율성과 같은 근본적인 문제에 직면해 있습니다. 기존 접근 방식은 단순히 행동 선택을 최적화하거나 추론 과정에서 세계 모델을 활용하는 데 그쳐, 계획 능력을 강화하기 위한 방법으로 세계를 모델링하는 학습의 이점을 간과해 왔습니다. 우리는 상태 예측과 행동 선택을 선호 학습을 통해 공동으로 최적화하는 새로운 학습 프레임워크인 이중 선호 최적화(Dual Preference Optimization, D^2PO)를 제안합니다. 이를 통해 LVLMs가 환경 역학을 이해하여 더 나은 계획을 수립할 수 있도록 합니다. 인간의 주석 없이도 궤적과 단계별 선호 데이터를 자동으로 수집하기 위해, 우리는 시행착오를 통한 광범위한 탐색을 위한 트리 탐색 메커니즘을 도입했습니다. VoTa-Bench에서의 광범위한 실험을 통해, 우리의 D^2PO 기반 방법이 Qwen2-VL (7B), LLaVA-1.6 (7B), 그리고 LLaMA-3.2 (11B)에 적용될 때 기존 방법들과 GPT-4o를 크게 능가하며, 더 효율적인 실행 경로로 우수한 작업 성공률을 달성함을 입증했습니다.

English

Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D^2PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D^2PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.

세계 모델링이 더 나은 플래너를 만든다: 구체화된 작업 계획을 위한 이중 선호 최적화

World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

초록

Support