HOComp:交互感知的人-物组合
HOComp: Interaction-Aware Human-Object Composition
July 22, 2025
作者: Dong Liang, Jinyuan Jia, Yuhao Liu, Rynson W. H. Lau
cs.AI
摘要
现有的图像引导合成方法虽能在用户指定的背景图像区域插入前景对象,并实现区域内自然融合而保持图像其余部分不变,但我们观察到,这些方法在处理涉及人-物交互的任务时,往往难以生成无缝的交互感知合成效果。本文首次提出HOComp,一种新颖的方法,用于将前景对象合成到以人为中心的背景图像中,同时确保前景对象与背景人物之间的和谐交互及其外观的一致性。我们的方法包含两大关键设计:(1) 基于MLLMs的区域姿态引导(MRPG),利用MLLMs识别交互区域及交互类型(如持握、举起),为生成的交互姿态提供从粗到细的约束,同时结合人体姿态关键点追踪动作变化,实施细粒度姿态约束;(2) 细节一致的外观保持(DCAP),通过统一形状感知的注意力调制机制、多视角外观损失及背景一致性损失,确保前景形状/纹理的一致性及背景人物的忠实再现。此外,我们提出了首个针对该任务的数据集——交互感知的人-物合成数据集(IHOC)。在数据集上的实验结果表明,HOComp能有效生成具有一致外观的和谐人-物交互,在定性和定量上均优于相关方法。
English
While existing image-guided composition methods may help insert a foreground
object onto a user-specified region of a background image, achieving natural
blending inside the region with the rest of the image unchanged, we observe
that these existing methods often struggle in synthesizing seamless
interaction-aware compositions when the task involves human-object
interactions. In this paper, we first propose HOComp, a novel approach for
compositing a foreground object onto a human-centric background image, while
ensuring harmonious interactions between the foreground object and the
background person and their consistent appearances. Our approach includes two
key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes
MLLMs to identify the interaction region as well as the interaction type (e.g.,
holding and lefting) to provide coarse-to-fine constraints to the generated
pose for the interaction while incorporating human pose landmarks to track
action variations and enforcing fine-grained pose constraints; and (2)
Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware
attention modulation mechanism, a multi-view appearance loss, and a background
consistency loss to ensure consistent shapes/textures of the foreground and
faithful reproduction of the background human. We then propose the first
dataset, named Interaction-aware Human-Object Composition (IHOC), for the task.
Experimental results on our dataset show that HOComp effectively generates
harmonious human-object interactions with consistent appearances, and
outperforms relevant methods qualitatively and quantitatively.