SPARK:大規模視覺語言模型的多視覺感知和推理基準測試
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
August 22, 2024
作者: Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, Yong Man Ro
cs.AI
摘要
大規模視覺語言模型(LVLMs)在與文本對齊的視覺輸入方面取得了顯著進展。通過將文本模態與視覺輸入對齊,它們在計算機視覺任務中取得了顯著進展。還有努力將多視覺傳感器納入其中,包括熱像、深度和醫學X射線圖像。然而,我們觀察到當前的LVLMs將從多視覺傳感器拍攝的圖像視為同一個RGB域,而沒有考慮多視覺傳感器的物理特性。它們未能從數據集中適當地傳達基本的多視覺傳感器信息和相應的上下文知識。因此,實際物理環境和文本信息之間的對齊並未正確實現,這使得難以回答考慮物理環境的複雜傳感器相關問題。在本文中,我們旨在建立一個名為SPARK的多視覺傳感器感知和推理基準,可以減少圖像與多視覺傳感器之間的基本信息差距。我們自動生成了6,248個視覺語言測試樣本,以研究多視覺感知和多視覺推理對不同格式的物理傳感器知識熟練度的影響,涵蓋不同類型的與傳感器相關的問題。我們利用這些樣本來評估十個領先的LVLMs。結果顯示,大多數模型在多視覺推理方面存在不同程度的缺陷。代碼和數據可在 https://github.com/top-yun/SPARK 上找到。
English
Large-scale Vision-Language Models (LVLMs) have significantly advanced with
text-aligned vision inputs. They have made remarkable progress in computer
vision tasks by aligning text modality with vision inputs. There are also
endeavors to incorporate multi-vision sensors beyond RGB, including thermal,
depth, and medical X-ray images. However, we observe that current LVLMs view
images taken from multi-vision sensors as if they were in the same RGB domain
without considering the physical characteristics of multi-vision sensors. They
fail to convey the fundamental multi-vision sensor information from the dataset
and the corresponding contextual knowledge properly. Consequently, alignment
between the information from the actual physical environment and the text is
not achieved correctly, making it difficult to answer complex sensor-related
questions that consider the physical environment. In this paper, we aim to
establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK
that can reduce the fundamental multi-vision sensor information gap between
images and multi-vision sensors. We generated 6,248 vision-language test
samples automatically to investigate multi-vision sensory perception and
multi-vision sensory reasoning on physical sensor knowledge proficiency across
different formats, covering different types of sensor-related questions. We
utilized these samples to assess ten leading LVLMs. The results showed that
most models displayed deficiencies in multi-vision sensory reasoning to varying
extents. Codes and data are available at https://github.com/top-yun/SPARKSummary
AI-Generated Summary