视觉语言模型能否在现实世界中应对面对面提问?
Can Vision-Language Models Answer Face to Face Questions in the Real-World?
March 25, 2025
作者: Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic
cs.AI
摘要
近年来,AI模型在描述和回答现实世界图像相关问题的能力上取得了显著进展。同时,它们在使用音频输入与用户实时对话方面也取得了进步。这引发了一个问题:我们是否已经达到了这样的阶段,即连接摄像头和麦克风的AI模型能够就摄像头前实时展开的场景和事件与用户进行对话?这一直是AI领域的一个长期目标,也是现实世界AI助手和人形机器人在日常情境中与人类互动的前提条件。在本研究中,我们引入了一个新的数据集和基准——高通互动视频数据集(IVD),用以评估现有模型在多大程度上支持这些能力,以及通过微调能在何种程度上培养这些能力。该数据集基于一个简单的问答设置,用户提出问题,系统需根据摄像头和音频输入实时作答。我们展示了现有模型在此任务上远落后于人类表现,并识别了性能差距的主要来源。然而,我们也表明,对于许多所需的感知技能,基于此类数据的微调能显著缩小这一差距。
English
AI models have made significant strides in recent years in their ability to
describe and answer questions about real-world images. They have also made
progress in the ability to converse with users in real-time using audio input.
This raises the question: have we reached the point where AI models, connected
to a camera and microphone, can converse with users in real-time about scenes
and events that are unfolding live in front of the camera? This has been a
long-standing goal in AI and is a prerequisite for real-world AI assistants and
humanoid robots to interact with humans in everyday situations. In this work,
we introduce a new dataset and benchmark, the Qualcomm Interactive Video
Dataset (IVD), which allows us to assess the extent to which existing models
can support these abilities, and to what degree these capabilities can be
instilled through fine-tuning. The dataset is based on a simple
question-answering setup, where users ask questions that the system has to
answer, in real-time, based on the camera and audio input. We show that
existing models fall far behind human performance on this task, and we identify
the main sources for the performance gap. However, we also show that for many
of the required perceptual skills, fine-tuning on this form of data can
significantly reduce this gap.Summary
AI-Generated Summary