ChatPaper.aiChatPaper

SPARK:大规模视觉-语言模型的多视觉传感器感知和推理基准。

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

August 22, 2024
作者: Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, Yong Man Ro
cs.AI

摘要

大规模视觉-语言模型(LVLMs)在文本对齐的视觉输入方面取得了显著进展。通过将文本模态与视觉输入对齐,它们在计算机视觉任务中取得了显著进展。还有一些尝试将多种视觉传感器整合到RGB之外,包括热像、深度和医学X射线图像。然而,我们观察到当前的LVLMs将来自多视觉传感器的图像视为在相同的RGB域中,而没有考虑多视觉传感器的物理特性。它们未能充分传达数据集中来自基本多视觉传感器的信息以及相应的上下文知识。因此,实际物理环境中的信息与文本之间的对齐没有正确实现,导致难以回答考虑物理环境的复杂传感器相关问题。在本文中,我们旨在建立一个名为SPARK的多视觉传感器感知和推理基准,以减少图像与多视觉传感器之间的基本信息差距。我们自动生成了6,248个视觉-语言测试样本,以研究多视觉感知和多视觉推理对不同格式的物理传感器知识熟练度的影响,涵盖了不同类型的传感器相关问题。我们利用这些样本评估了十个领先的LVLMs。结果显示,大多数模型在多视觉推理方面存在不同程度的缺陷。代码和数据可在https://github.com/top-yun/SPARK获取。
English
Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples automatically to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK

Summary

AI-Generated Summary

PDF143November 16, 2024