ViPlan: シンボリック述語と視覚言語モデルを用いた視覚的計画のためのベンチマーク

要旨

大規模言語モデルとシンボリックプランナーを統合することは、自然言語でのプランニングと比較して検証可能で根拠のあるプランを得るための有望な方向性であり、最近の研究ではこのアイデアを視覚言語モデル（VLM）を用いて視覚領域に拡張しています。しかし、VLMに基づくシンボリックアプローチとVLMを直接使用してプランニングする方法との厳密な比較は、共通の環境、評価プロトコル、モデルカバレッジの不足によって妨げられてきました。本論文では、シンボリック述語とVLMを用いた視覚的プランニングのための最初のオープンソースベンチマークであるViPlanを紹介します。ViPlanは、古典的なBlocksworldプランニング問題の視覚的バリアントと、シミュレートされた家庭用ロボティクス環境という2つのドメインにおいて、難易度が段階的に増す一連のタスクを特徴としています。我々は、複数のサイズの9つのオープンソースVLMファミリーと、選択されたクローズドモデルをベンチマークし、VLMに基づくシンボリックプランニングとモデルを直接使用してアクションを提案する方法の両方を評価しました。その結果、正確な画像の根拠付けが重要なBlocksworldではシンボリックプランニングが直接的なVLMプランニングを上回り、一方で常識的な知識とエラーからの回復能力が有益な家庭用ロボティクスタスクではその逆が真であることがわかりました。最後に、ほとんどのモデルと方法において、Chain-of-Thoughtプロンプティングを使用することに有意な利点がないことを示し、現在のVLMが視覚的推論にまだ苦戦していることを示唆しています。

English

Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.

ViPlan: シンボリック述語と視覚言語モデルを用いた視覚的計画のためのベンチマーク

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

要旨

Support