基于流式第一人称视频的主动式助手对话生成

摘要

对话式人工智能近期取得了显著进展，但开发用于感知任务指导的实时系统仍面临挑战。这些系统需基于流式视觉输入提供交互式、主动式的协助，然而其开发受限于数据收集和系统评估过程中高昂且劳动密集的成本。为应对这些局限，我们提出了一个包含三大关键贡献的综合性框架。首先，我们引入了一种新颖的数据整理流程，通过标注的自我中心视角视频合成对话，从而创建了\dataset，一个跨多个领域的大规模合成对话数据集。其次，我们开发了一套自动评估指标，并通过广泛的人体研究验证其有效性。最后，我们提出了一种端到端模型，该模型处理流式视频输入以生成上下文相关的响应，并融入了处理数据不平衡和长视频的新技术。此项工作为开发能够引导用户完成多样化任务的实时、主动型AI助手奠定了基础。项目页面：https://pro-assist.github.io/

English

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

基于流式第一人称视频的主动式助手对话生成

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

摘要

Support