ChatPaper.aiChatPaper

头像驱动:面向自然对话的实时交互式头部虚拟人生成

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

January 2, 2026
作者: Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
cs.AI

摘要

说话人头像生成技术能够从静态肖像创建逼真的虚拟形象,用于虚拟交流与内容创作。然而,现有模型尚不能传递真正互动交流的体验,往往生成单向响应而缺乏情感共鸣。我们发现实现真正交互式虚拟形象存在两大核心挑战:在因果约束下实现实时运动生成,以及无需额外标注数据即可学习富有表现力的生动反应。为此,我们提出Avatar Forcing——一种通过扩散驱动建模实时用户-虚拟形象交互的新框架。该设计使虚拟形象能够以低延迟处理实时多模态输入(包括用户音频与动作),即时响应言语和非言语线索(如语音、点头和笑声)。此外,我们引入基于用户条件丢弃构建合成负样本的直接偏好优化方法,实现无需标注的表达性交互学习。实验结果表明,我们的框架可实现低延迟(约500毫秒)实时交互,较基线加速6.8倍,并生成具有反应力与表现力的虚拟形象动作,在用户评估中以超过80%的偏好度优于基线系统。
English
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
PDF431January 6, 2026