アバターに相互作用を持たせる：制御可能な話し手アバターのためのテキスト駆動型人物-物体インタラクションに向けて

要旨

話し手アバターの生成は映像生成における基礎的な課題である。既存手法では単純な人体動作を伴う全身話し手アバターを生成可能だが、この課題を接地型人物-物体相互作用（GHOI）に拡張することは未解決の難題であり、アバターが周囲の物体とテキストに沿った相互作用を実行する必要がある。この課題は環境知覚の必要性とGHOI生成における制御品質ジレンマに起因する。これに対処するため、我々は新規のデュアルストリームフレームワークInteractAvatarを提案する。これは接地型人物-物体相互作用において、知覚と計画を映像合成から分離するものである。検出技術を活用して環境知覚を強化するため、テキストに沿った相互作用動作を生成する知覚・相互作用モジュール（PIM）を導入する。さらに、物体相互作用を行う生き生きとした話し手アバターを合成するための音声-相互作用認識生成モジュール（AIM）を提案する。特別に設計された動作-映像連携機構により、PIMとAIMは類似のネットワーク構造を共有し、動作と妥当な映像の並列共生成を可能とし、制御品質ジレンマを効果的に緩和する。最後に、GHOI映像生成を評価するためのベンチマークGroundedInterを構築した。大規模な実験と比較により、話し手アバターの接地型人物-物体相互作用生成における本手法の有効性を実証する。プロジェクトページ: https://interactavatar.github.io

English

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

アバターに相互作用を持たせる：制御可能な話し手アバターのためのテキスト駆動型人物-物体インタラクションに向けて

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

要旨

Support