让虚拟形象互动：迈向可控说话头像的文本驱动人机交互

摘要

生成会说话的数字人是视频生成领域的一项基础任务。尽管现有方法能够生成具有简单人体动作的全身说话数字人，但将该任务扩展至具身人机交互（GHOI）仍面临挑战，这要求数字人能够与周围物体执行文本对齐的交互操作。这一挑战源于环境感知的需求以及GHOI生成中控制质量两难问题。为此，我们提出新型双流框架InteractAvatar，将具身人机交互中的环境感知、规划与视频合成进行解耦。通过引入检测技术增强环境感知能力，我们开发了感知交互模块（PIM）来生成文本对齐的交互动作。此外，提出音频交互感知生成模块（AIM）来合成执行物体交互的生动说话数字人。借助专门设计的运动-视频对齐器，PIM与AIM采用相似网络结构，可实现动作与合理视频的并行协同生成，有效缓解控制质量两难问题。最后，我们建立了GroundedInter基准数据集用于评估GHOI视频生成。大量实验与对比分析证明了本方法在生成具身人机交互说话数字人方面的有效性。项目页面：https://interactavatar.github.io

English

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

让虚拟形象互动：迈向可控说话头像的文本驱动人机交互

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

摘要

Support