ViPlan:基于符号谓词与视觉-语言模型的视觉规划基准
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
May 19, 2025
作者: Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen
cs.AI
摘要
将大型语言模型与符号规划器相结合,相较于自然语言规划,为获取可验证且基于实地的规划方案开辟了一条前景广阔的道路。近期研究进一步将这一理念扩展至视觉领域,运用视觉-语言模型(VLMs)实现。然而,由于缺乏统一的环境、评估协议及模型覆盖范围,VLM赋能的符号规划方法与直接利用VLM进行规划的方法之间,尚未能进行严谨比较。为此,我们推出了ViPlan,首个面向视觉规划的开源基准测试平台,它结合了符号谓词与VLMs。ViPlan包含两个领域内一系列难度递增的任务:经典积木世界规划问题的视觉变体及模拟家庭机器人环境。我们对九大家族开源VLM及部分闭源模型进行了多规模基准测试,既评估了VLM赋能的符号规划,也测试了模型直接提出行动方案的能力。研究发现,在精确图像定位至关重要的积木世界任务中,符号规划优于直接VLM规划;而在家庭机器人任务中,情况恰恰相反,常识知识与错误恢复能力显得更为重要。最后,我们指出,在大多数模型与方法中,采用思维链提示并未带来显著优势,这表明当前VLMs在视觉推理方面仍面临挑战。
English
Integrating Large Language Models with symbolic planners is a promising
direction for obtaining verifiable and grounded plans compared to planning in
natural language, with recent works extending this idea to visual domains using
Vision-Language Models (VLMs). However, rigorous comparison between
VLM-grounded symbolic approaches and methods that plan directly with a VLM has
been hindered by a lack of common environments, evaluation protocols and model
coverage. We introduce ViPlan, the first open-source benchmark for Visual
Planning with symbolic predicates and VLMs. ViPlan features a series of
increasingly challenging tasks in two domains: a visual variant of the classic
Blocksworld planning problem and a simulated household robotics environment. We
benchmark nine open-source VLM families across multiple sizes, along with
selected closed models, evaluating both VLM-grounded symbolic planning and
using the models directly to propose actions. We find symbolic planning to
outperform direct VLM planning in Blocksworld, where accurate image grounding
is crucial, whereas the opposite is true in the household robotics tasks, where
commonsense knowledge and the ability to recover from errors are beneficial.
Finally, we show that across most models and methods, there is no significant
benefit to using Chain-of-Thought prompting, suggesting that current VLMs still
struggle with visual reasoning.Summary
AI-Generated Summary