让虚拟形象互动：迈向文本驱动的可控说话形象人机交互

摘要

生成会说话的数字人是视频生成领域的一项基础任务。尽管现有方法能够生成带有简单人体动作的全身说话形象，但将该任务扩展到具身人机交互（GHOI）仍面临挑战，需要数字人与周围物体进行文本对齐的交互。这一挑战源于环境感知的需求以及GHOI生成中控制质量两难的问题。为此，我们提出新型双流框架InteractAvatar，将感知规划与视频合成解耦以应对具身人机交互。通过引入检测技术增强环境感知，我们开发了感知交互模块（PIM）来生成文本对齐的交互动作。此外，提出音频交互感知生成模块（AIM）来合成执行物体交互的生动说话数字人。借助专门设计的运动-视频对齐器，PIM与AIM采用相似网络结构，可实现动作与合理视频的并行协同生成，有效缓解控制质量两难问题。最后，我们建立了GroundedInter基准数据集用于评估GHOI视频生成。大量实验对比表明，我们的方法在生成具身人机交互的说话数字人方面具有显著优势。项目页面：https://interactavatar.github.io

English

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

让虚拟形象互动：迈向文本驱动的可控说话形象人机交互

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

摘要

Support