ストリーミングエゴセントリックビデオからの能動的アシスタント対話生成

要旨

近年の対話型AIの進展は目覚ましいものがあるが、知覚的タスクガイダンスのためのリアルタイムシステムの開発は依然として課題が多い。これらのシステムは、ストリーミングされる視覚入力を基に、インタラクティブで先行的な支援を提供する必要があるが、その開発は、データ収集とシステム評価のコスト高で労力を要するプロセスによって制約されている。これらの制約に対処するため、我々は3つの主要な貢献を備えた包括的なフレームワークを提案する。第一に、注釈付きエゴセントリックビデオから対話を合成する新しいデータキュレーションパイプラインを導入し、複数ドメインにまたがる大規模な合成対話データセット\datasetを構築した。第二に、広範な人間による研究を通じて検証された自動評価指標のスイートを開発した。第三に、ストリーミングビデオ入力を処理して文脈に適した応答を生成するエンドツーエンドモデルを提案し、データの不均衡や長時間ビデオの処理のための新たな技術を組み込んだ。この研究は、多様なタスクを通じてユーザーをガイドする能力を持つリアルタイムで先行的なAIアシスタントの開発の基盤を築くものである。プロジェクトページ: https://pro-assist.github.io/

English

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

ストリーミングエゴセントリックビデオからの能動的アシスタント対話生成

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

要旨

Support