ChatPaper.aiChatPaper

頭像驅動:自然對話的即時互動頭像生成技術

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

January 2, 2026
作者: Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
cs.AI

摘要

頭像生成技術能將靜態肖像轉化為生動的虛擬形象,應用於虛擬交流與內容創作。然而現有模型尚無法實現真正具互動感的交流,常生成單向回應而缺乏情感共鳴。我們發現實現真正互動式虛擬形象的兩大核心挑戰:在因果約束下實現即時動作生成,以及無需標註數據即可學習富有表現力的生動反應。為此,我們提出「頭像驅動」框架,通過擴散驅動技術建模用戶與虛擬形象的即時互動。該設計使虛擬形象能低延遲處理多模態輸入(包括用戶語音與動作),即時響應言語與非言語線索(如對話、點頭、笑聲)。此外,我們引入直接偏好優化方法,利用刪減用戶條件構建的合成負樣本,實現無標註的表現力互動學習。實驗結果表明,本框架可實現低延遲(約500毫秒)的即時互動,較基準模型提速6.8倍,生成的反應式虛擬形象動作表現力出眾,在超過80%的對比評估中優於基準模型。
English
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
PDF431January 6, 2026