HOComp: Composición Interactiva de Humanos y Objetos

Resumen

Si bien los métodos existentes de composición guiada por imágenes pueden ayudar a insertar un objeto en primer plano en una región especificada por el usuario de una imagen de fondo, logrando una mezcla natural dentro de la región mientras el resto de la imagen permanece sin cambios, observamos que estos métodos existentes a menudo tienen dificultades para sintetizar composiciones conscientes de la interacción de manera fluida cuando la tarea involucra interacciones humano-objeto. En este artículo, primero proponemos HOComp, un enfoque novedoso para componer un objeto en primer plano en una imagen de fondo centrada en humanos, asegurando interacciones armoniosas entre el objeto en primer plano y la persona en el fondo, así como apariencias consistentes. Nuestro enfoque incluye dos diseños clave: (1) Guía de Pose Basada en Regiones impulsada por MLLMs (MRPG), que utiliza MLLMs para identificar la región de interacción y el tipo de interacción (por ejemplo, sostener o levantar) para proporcionar restricciones de grano grueso a fino a la pose generada para la interacción, incorporando puntos de referencia de la pose humana para rastrear variaciones de acción y aplicar restricciones de pose detalladas; y (2) Preservación de Apariencia Consistente en Detalles (DCAP), que unifica un mecanismo de modulación de atención consciente de la forma, una pérdida de apariencia multi-vista y una pérdida de consistencia de fondo para garantizar formas/texturas consistentes del primer plano y una reproducción fiel del humano en el fondo. Luego, proponemos el primer conjunto de datos, denominado Composición Humano-Objeto Consciente de la Interacción (IHOC), para esta tarea. Los resultados experimentales en nuestro conjunto de datos muestran que HOComp genera efectivamente interacciones armoniosas entre humanos y objetos con apariencias consistentes, superando cualitativa y cuantitativamente a los métodos relevantes.

English

While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.

HOComp: Composición Interactiva de Humanos y Objetos

HOComp: Interaction-Aware Human-Object Composition

Resumen

Support