ChatPaper.aiChatPaper

EgoSpeak:在真实场景中为自我中心对话代理学习何时发言

EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild

February 17, 2025
作者: Junhyeok Kim, Min Soo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu
cs.AI

摘要

在现实环境中预测何时发起对话,仍是会话智能体面临的一项根本性挑战。我们提出了EgoSpeak,一个用于实时预测第一人称视角流媒体视频中对话启动的新颖框架。通过从说话者的第一人称视角建模对话,EgoSpeak专为模拟人类互动而设计,在这种互动中,会话智能体需持续观察其环境并动态决定何时发言。我们的方法通过整合四项关键能力,弥合了简化实验设置与复杂自然对话之间的鸿沟:(1)第一人称视角,(2)RGB图像处理,(3)在线处理,以及(4)未剪辑视频处理。此外,我们还推出了YT-Conversation,一个来自YouTube的多样化现实对话视频集合,作为大规模预训练的资源。在EasyCom和Ego4D数据集上的实验表明,EgoSpeak在实时性能上优于随机和基于静音的基线方法。我们的结果还突显了多模态输入和上下文长度在有效决策发言时机中的重要性。
English
Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce EgoSpeak, a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker's first-person viewpoint, EgoSpeak is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk. Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that EgoSpeak outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak.

Summary

AI-Generated Summary

PDF62February 24, 2025