ChatPaper.aiChatPaper

ViPlan:一個基於符號謂詞與視覺語言模型的視覺規劃基準

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

May 19, 2025
作者: Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen
cs.AI

摘要

將大型語言模型與符號規劃器相結合,相較於自然語言規劃,是獲取可驗證且具基礎性計劃的一個有前景的方向。近期研究已將這一理念延伸至視覺領域,利用視覺-語言模型(VLMs)進行探索。然而,由於缺乏統一的環境、評估協議和模型覆蓋範圍,VLM基礎的符號方法與直接使用VLM進行規劃的方法之間嚴謹的比較一直受到阻礙。我們推出了ViPlan,這是首個針對視覺規劃的開源基準測試,涵蓋符號謂詞與VLMs。ViPlan包含兩個領域中一系列難度遞增的任務:經典Blocksworld規劃問題的視覺變體以及一個模擬家庭機器人環境。我們對九個開源VLM家族進行了多種規模的基準測試,並選取了部分閉源模型,既評估了基於VLM的符號規劃,也評估了直接使用模型提出行動方案的效果。我們發現,在Blocksworld任務中,符號規劃優於直接VLM規劃,這其中精確的圖像基礎至關重要;而在家庭機器人任務中,情況則相反,常識知識和從錯誤中恢復的能力更為有利。最後,我們展示出,在大多數模型和方法中,使用思維鏈提示並未帶來顯著益處,這表明當前的VLMs在視覺推理方面仍面臨挑戰。
English
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.

Summary

AI-Generated Summary

PDF121May 20, 2025