Hacer que los avatares interactúen: hacia la interacción humano-objeto impulsada por texto para avatares parlantes controlables

Resumen

La generación de avatares parlantes es una tarea fundamental en la generación de vídeo. Aunque los métodos existentes pueden generar avatares parlantes de cuerpo completo con movimiento humano simple, extender esta tarea a la interacción humano-objeto contextual (GHOI) sigue siendo un desafío abierto, que requiere que el avatar realice interacciones alineadas con texto con los objetos circundantes. Este desafío surge de la necesidad de percepción ambiental y del dilema control-calidad en la generación de GHOI. Para abordarlo, proponemos un novedoso marco de doble flujo, InteractAvatar, que desacopla la percepción y la planificación de la síntesis de vídeo para la interacción humano-objeto contextual. Aprovechando la detección para mejorar la percepción ambiental, introducimos un Módulo de Percepción e Interacción (PIM) para generar movimientos de interacción alineados con el texto. Adicionalmente, se propone un Módulo de Generación Consciente de Audio-Interacción (AIM) para sintetizar avatares parlantes vívidos que realizan interacciones con objetos. Con un alineador movimiento-a-vídeo especialmente diseñado, el PIM y el AIM comparten una estructura de red similar y permiten la cogeneración paralela de movimientos y vídeos plausibles, mitigando efectivamente el dilema control-calidad. Finalmente, establecemos un benchmark, GroundedInter, para evaluar la generación de vídeos GHOI. Experimentos extensos y comparaciones demuestran la efectividad de nuestro método para generar interacciones humano-objeto contextuales para avatares parlantes. Página del proyecto: https://interactavatar.github.io

English

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io