視覺語言模型能否在現實世界中回答面對面的問題？

摘要

近年來，AI模型在描述和回答現實世界圖像相關問題的能力上取得了顯著進展。同時，它們在使用音頻輸入與用戶進行實時對話方面也取得了進步。這引發了一個問題：我們是否已經達到了一個階段，即連接攝像頭和麥克風的AI模型能夠就攝像頭前實時展開的場景和事件與用戶進行對話？這一直是AI領域的一個長期目標，也是現實世界AI助手和人形機器人在日常情境中與人類互動的先決條件。在本研究中，我們引入了一個新的數據集和基準——高通互動視頻數據集（IVD），它使我們能夠評估現有模型在多大程度上支持這些能力，以及通過微調能在多大程度上培養這些能力。該數據集基於一個簡單的問答設置，用戶提出問題，系統必須根據攝像頭和音頻輸入實時回答。我們展示了現有模型在此任務上遠遠落後於人類表現，並找出了性能差距的主要來源。然而，我們也表明，對於許多所需的感知技能，在此類數據上進行微調可以顯著縮小這一差距。

English

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

視覺語言模型能否在現實世界中回答面對面的問題？

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

摘要

Support