OmniSpatial：邁向視覺語言模型的全面空間推理基準

摘要

空間推理是認知心理學的一個關鍵方面，也是當前視覺-語言模型（VLMs）的主要瓶頸。儘管已有大量研究旨在評估或提升VLMs對基本空間關係的理解，例如區分左右、遠近以及物體計數，這些任務僅代表了空間推理的最基礎層次。在本研究中，我們引入了OmniSpatial，這是一個基於認知心理學的全面且具挑戰性的空間推理基準。OmniSpatial涵蓋了四大類別：動態推理、複雜空間邏輯、空間互動及視角轉換，並細分為50個子類別。通過網絡數據爬取和精細的人工標註，我們構建了超過1,500個問答對。大量實驗表明，無論是開源還是閉源的VLMs，以及現有的推理和空間理解模型，在全面空間理解方面均表現出顯著的局限性。我們進一步分析了失敗案例，並提出了未來研究的潛在方向。

English

Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

OmniSpatial：邁向視覺語言模型的全面空間推理基準

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

摘要

Support