让虚拟形象互动:迈向文本驱动的可控说话形象人机交互
Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
February 2, 2026
作者: Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li
cs.AI
摘要
生成会说话的数字人是视频生成领域的一项基础任务。尽管现有方法能够生成带有简单人体动作的全身说话形象,但将该任务扩展到具身人机交互(GHOI)仍面临挑战,需要数字人与周围物体进行文本对齐的交互。这一挑战源于环境感知的需求以及GHOI生成中控制质量两难的问题。为此,我们提出新型双流框架InteractAvatar,将感知规划与视频合成解耦以应对具身人机交互。通过引入检测技术增强环境感知,我们开发了感知交互模块(PIM)来生成文本对齐的交互动作。此外,提出音频交互感知生成模块(AIM)来合成执行物体交互的生动说话数字人。借助专门设计的运动-视频对齐器,PIM与AIM采用相似网络结构,可实现动作与合理视频的并行协同生成,有效缓解控制质量两难问题。最后,我们建立了GroundedInter基准数据集用于评估GHOI视频生成。大量实验对比表明,我们的方法在生成具身人机交互的说话数字人方面具有显著优势。项目页面:https://interactavatar.github.io
English
Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io