SPARK: 대규모 비전-언어 모델을 위한 다중 비전 센서 지각 및 추론 벤치마크

초록

대규모 시각-언어 모델(LVLMs)은 텍스트에 맞춘 시각 입력으로 크게 발전해 왔습니다. 텍스트 모드와 시각 입력을 조정함으로써 컴퓨터 비전 작업에서 놀라운 진전을 이루었습니다. RGB 이상의 다중 비전 센서(열화상, 깊이, 의료 X-선 이미지 포함)를 통합하기 위한 노력도 있습니다. 그러나 현재의 LVLMs는 다중 비전 센서에서 촬영된 이미지를 물리적 특성을 고려하지 않고 동일한 RGB 도메인으로 간주합니다. 이로 인해 데이터셋에서 다중 비전 센서 정보와 해당 문맥적 지식을 제대로 전달하지 못합니다. 결과적으로 실제 물리적 환경으로부터의 정보와 텍스트 간의 정확한 조정이 이루어지지 않아, 물리적 환경을 고려하는 복잡한 센서 관련 질문에 대답하기 어려워집니다. 본 논문에서는 이미지와 다중 비전 센서 간의 기본적인 정보 격차를 줄일 수 있는 다중 비전 센서 지각 및 추론 벤치마크인 SPARK를 수립하는 것을 목표로 합니다. 우리는 다양한 형식의 다양한 유형의 센서 관련 질문을 다루며, 물리적 센서 지식 능력에 대한 다중 비전 감각 및 다중 비전 추론을 조사하기 위해 6,248개의 시각-언어 테스트 샘플을 자동으로 생성했습니다. 이러한 샘플을 활용하여 열 가지 선도적 LVLMs를 평가했습니다. 결과는 대부분의 모델이 다양한 정도로 다중 비전 감각 추론에서 결함을 보여주었음을 보여주었습니다. 코드 및 데이터는 https://github.com/top-yun/SPARK에서 사용할 수 있습니다.

English

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples automatically to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK

SPARK: 대규모 비전-언어 모델을 위한 다중 비전 센서 지각 및 추론 벤치마크

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

초록

Support