스트리밍 에고센트릭 비디오를 활용한 능동적 어시스턴트 대화 생성

초록

최근 대화형 AI의 발전은 상당하지만, 지각적 작업 지도를 위한 실시간 시스템 개발은 여전히 도전적인 과제로 남아 있습니다. 이러한 시스템은 스트리밍 시각 입력을 기반으로 상호작용적이고 능동적인 지원을 제공해야 하지만, 데이터 수집 및 시스템 평가의 비용이 많이 들고 노동 집약적인 과정으로 인해 개발이 제한되고 있습니다. 이러한 한계를 해결하기 위해, 우리는 세 가지 주요 기여를 포함한 포괄적인 프레임워크를 제시합니다. 첫째, 주석이 달린 에고센트릭 비디오에서 대화를 합성하는 새로운 데이터 큐레이션 파이프라인을 소개하여, 여러 도메인에 걸친 대규모 합성 대화 데이터셋인 \dataset을 생성합니다. 둘째, 광범위한 인간 연구를 통해 검증된 자동 평가 메트릭 세트를 개발합니다. 셋째, 데이터 불균형과 장기간 비디오를 처리하기 위한 새로운 기술을 통합하여 스트리밍 비디오 입력을 처리하고 상황에 적절한 응답을 생성하는 종단 간 모델을 제안합니다. 이 작업은 다양한 작업을 통해 사용자를 안내할 수 있는 실시간 능동형 AI 어시스턴트 개발의 기반을 마련합니다. 프로젝트 페이지: https://pro-assist.github.io/

English

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

스트리밍 에고센트릭 비디오를 활용한 능동적 어시스턴트 대화 생성

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

초록

Support