ViewSpatial-Bench: 시각-언어 모델의 다중 관점 공간 위치 인식 능력 평가

초록

비전-언어 모델(VLMs)은 시각적 콘텐츠를 이해하고 추론하는 데 있어 뛰어난 능력을 보여주었지만, 시점 간 이해와 공간 추론이 필요한 작업에서는 여전히 상당한 과제가 남아 있습니다. 우리는 중요한 한계를 발견했습니다: 현재의 VLMs는 주로 카메라의 관점에서의 자기 중심적 공간 추론에 뛰어나지만, 다른 개체의 공간적 참조 프레임을 채택해야 할 때 타자 중심적 시점으로 일반화하는 데 실패합니다. 우리는 ViewSpatial-Bench를 소개합니다. 이는 다중 시점 공간 위치 인식 평가를 위해 특별히 설계된 첫 번째 포괄적인 벤치마크로, 정확한 방향 레이블을 생성하는 자동화된 3D 주석 파이프라인을 통해 다섯 가지 독특한 작업 유형을 지원합니다. ViewSpatial-Bench에서 다양한 VLMs를 종합적으로 평가한 결과, 모델들이 카메라 관점 작업에서는 합리적인 성능을 보이지만 인간의 관점에서 추론할 때는 정확도가 감소하는 상당한 성능 격차가 나타났습니다. 우리의 다중 관점 공간 데이터셋을 통해 VLMs를 미세 조정함으로써, 작업 전반에 걸쳐 46.24%의 성능 향상을 달성했으며, 이는 우리의 접근 방식의 효율성을 강조합니다. 우리의 연구는 구현된 AI 시스템에서의 공간 지능을 위한 중요한 벤치마크를 확립하고, 3D 공간 관계를 모델링함으로써 VLMs의 해당 공간 이해 능력이 향상된다는 경험적 증거를 제공합니다.

English

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

ViewSpatial-Bench: 시각-언어 모델의 다중 관점 공간 위치 인식 능력 평가

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

초록

Support