视觉语言模型能否在现实世界中应对面对面提问？

摘要

近年来，AI模型在描述和回答现实世界图像相关问题的能力上取得了显著进展。同时，它们在使用音频输入与用户实时对话方面也取得了进步。这引发了一个问题：我们是否已经达到了这样的阶段，即连接摄像头和麦克风的AI模型能够就摄像头前实时展开的场景和事件与用户进行对话？这一直是AI领域的一个长期目标，也是现实世界AI助手和人形机器人在日常情境中与人类互动的前提条件。在本研究中，我们引入了一个新的数据集和基准——高通互动视频数据集（IVD），用以评估现有模型在多大程度上支持这些能力，以及通过微调能在何种程度上培养这些能力。该数据集基于一个简单的问答设置，用户提出问题，系统需根据摄像头和音频输入实时作答。我们展示了现有模型在此任务上远落后于人类表现，并识别了性能差距的主要来源。然而，我们也表明，对于许多所需的感知技能，基于此类数据的微调能显著缩小这一差距。

English

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

视觉语言模型能否在现实世界中应对面对面提问？

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

摘要

Support