EgoSpeak:為真實場景中的自我中心對話代理學習何時發言
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
February 17, 2025
作者: Junhyeok Kim, Min Soo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu
cs.AI
摘要
在現實環境中預測何時啟動語音對話,仍然是對話代理系統面臨的根本性挑戰。我們提出了EgoSpeak,這是一個用於即時語音啟動預測的新框架,專注於自我中心視角的串流視頻。通過從說話者的第一人稱視角建模對話,EgoSpeak專為實現類人互動而設計,在這種互動中,對話代理必須持續觀察其環境並動態決定何時發言。我們的方法通過整合四大關鍵能力,彌合了簡化實驗設置與複雜自然對話之間的差距:(1) 第一人稱視角,(2) RGB影像處理,(3) 線上處理,以及(4) 未修剪視頻處理。此外,我們還推出了YT-Conversation,這是一個來自YouTube的多樣化真實對話視頻集合,作為大規模預訓練的資源。在EasyCom和Ego4D上的實驗表明,EgoSpeak在即時處理中優於隨機和基於靜音的基準方法。我們的結果也突顯了多模態輸入和上下文長度在有效決定何時發言中的重要性。
English
Predicting when to initiate speech in real-world environments remains a
fundamental challenge for conversational agents. We introduce EgoSpeak, a novel
framework for real-time speech initiation prediction in egocentric streaming
video. By modeling the conversation from the speaker's first-person viewpoint,
EgoSpeak is tailored for human-like interactions in which a conversational
agent must continuously observe its environment and dynamically decide when to
talk. Our approach bridges the gap between simplified experimental setups and
complex natural conversations by integrating four key capabilities: (1)
first-person perspective, (2) RGB processing, (3) online processing, and (4)
untrimmed video processing. We also present YT-Conversation, a diverse
collection of in-the-wild conversational videos from YouTube, as a resource for
large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that
EgoSpeak outperforms random and silence-based baselines in real time. Our
results also highlight the importance of multimodal input and context length in
effectively deciding when to speak.Summary
AI-Generated Summary