世界建模成就更優規劃者:雙重偏好優化於具身任務規劃
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
March 13, 2025
作者: Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, Xipeng Qiu
cs.AI
摘要
近期大型視覺語言模型(LVLMs)的進展在具身任務規劃方面展現了潛力,然而這些模型在依賴性約束和效率等基本挑戰上仍存在困難。現有方法要么僅優化動作選擇,要么在推理過程中利用世界模型,卻忽視了通過學習建模世界來增強規劃能力的好處。我們提出了雙重偏好優化(D^2PO),這是一種新的學習框架,通過偏好學習聯合優化狀態預測和動作選擇,使LVLMs能夠理解環境動態以實現更好的規劃。為了在無需人工註釋的情況下自動收集軌跡和逐步偏好數據,我們引入了一種樹搜索機制,通過試錯進行廣泛探索。在VoTa-Bench上的大量實驗表明,當應用於Qwen2-VL(7B)、LLaVA-1.6(7B)和LLaMA-3.2(11B)時,我們基於D^2PO的方法顯著優於現有方法和GPT-4o,以更高效的執行路徑實現了更高的任務成功率。
English
Recent advances in large vision-language models (LVLMs) have shown promise
for embodied task planning, yet they struggle with fundamental challenges like
dependency constraints and efficiency. Existing approaches either solely
optimize action selection or leverage world models during inference,
overlooking the benefits of learning to model the world as a way to enhance
planning capabilities. We propose Dual Preference Optimization (D^2PO), a new
learning framework that jointly optimizes state prediction and action selection
through preference learning, enabling LVLMs to understand environment dynamics
for better planning. To automatically collect trajectories and stepwise
preference data without human annotation, we introduce a tree search mechanism
for extensive exploration via trial-and-error. Extensive experiments on
VoTa-Bench demonstrate that our D^2PO-based method significantly outperforms
existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and
LLaMA-3.2 (11B), achieving superior task success rates with more efficient
execution paths.Summary
AI-Generated Summary