視覚言語モデルは現実世界の対面質問に答えられるか？

要旨

近年、AIモデルは現実世界の画像を説明し、それに関する質問に答える能力において大きな進歩を遂げてきました。また、音声入力を用いてユーザーとリアルタイムで会話する能力においても進展が見られています。これにより、カメラとマイクに接続されたAIモデルが、カメラの前でリアルタイムに展開されているシーンや出来事についてユーザーと会話できる段階に到達したのか、という疑問が浮かび上がります。これはAIにおける長年の目標であり、現実世界のAIアシスタントやヒューマノイドロボットが日常的な状況で人間と対話するための前提条件です。本研究では、既存のモデルがこれらの能力をどの程度サポートできるか、またファインチューニングを通じてこれらの能力をどの程度習得できるかを評価するための新しいデータセットとベンチマーク、Qualcomm Interactive Video Dataset (IVD)を紹介します。このデータセットは、ユーザーが質問をし、システムがカメラと音声入力に基づいてリアルタイムで回答するというシンプルな質問応答形式に基づいています。我々は、既存のモデルがこのタスクにおいて人間のパフォーマンスに大きく遅れをとっていることを示し、その性能差の主な要因を特定します。しかし、多くの必要な知覚スキルにおいて、この形式のデータを用いたファインチューニングがこのギャップを大幅に縮めることができることも示します。

English

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

視覚言語モデルは現実世界の対面質問に答えられるか？

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

要旨

Support