아바타 상호작용 구현: 제어 가능한 말하는 아바타를 위한 텍스트 기반 인간-객체 상호작용 연구

초록

대화형 아바타 생성은 비디오 생성의 핵심 과제입니다. 기존 방법들은 단순한 인간 동작을 가진 전신 대화형 아바타를 생성할 수 있지만, 이를 접지된 인간-객체 상호작용(GHOI)으로 확장하는 것은 열려 있는 과제로 남아있습니다. 이는 아바타가 주변 객체와 텍스트에 부합하는 상호작용을 수행해야 하기 때문입니다. 이러한 도전 과제는 환경 인식의 필요성과 GHOI 생성에서의 제어-품질 딜레마에서 비롯됩니다. 이를 해결하기 위해 우리는 접지된 인간-객체 상호작용을 위해 인식 및 계획을 비디오 합성에서 분리하는 새로운 이중 스트림 프레임워크인 InteractAvatar를 제안합니다. 탐지를 활용하여 환경 인식을 향상시키기 위해, 우리는 텍스트에 부합하는 상호작용 동작을 생성하는 인식 및 상호작용 모듈(PIM)을 도입합니다. 추가적으로 객체 상호작용을 수행하는 생생한 대화형 아바타를 합성하기 위한 오디오-상호작용 인식 생성 모듈(AIM)을 제안합니다. 특별히 설계된 동작-비디오 정렬기를 통해 PIM과 AIM은 유사한 네트워크 구조를 공유하며 동작과 그럴듯한 비디오의 병렬 공동 생성을 가능하게 하여 제어-품질 딜레마를 효과적으로 완화합니다. 마지막으로, 우리는 GHOI 비디오 생성을 평가하기 위한 벤치마크인 GroundedInter를 구축했습니다. 광범위한 실험과 비교를 통해 우리 방법이 대화형 아바타를 위한 접지된 인간-객체 상호작용 생성에 효과적임을 입증합니다. 프로젝트 페이지: https://interactavatar.github.io

English

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

아바타 상호작용 구현: 제어 가능한 말하는 아바타를 위한 텍스트 기반 인간-객체 상호작용 연구

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

초록

Support