HOComp:互動感知的人與物體組合
HOComp: Interaction-Aware Human-Object Composition
July 22, 2025
作者: Dong Liang, Jinyuan Jia, Yuhao Liu, Rynson W. H. Lau
cs.AI
摘要
現有的圖像引導合成方法雖能協助將前景物體嵌入用戶指定的背景圖像區域,並在該區域內實現自然融合而不改變圖像其餘部分,但我們觀察到,當任務涉及人與物體互動時,這些現有方法在合成無縫且感知互動的合成圖像方面往往力不從心。本文中,我們首先提出HOComp,這是一種新穎的方法,用於將前景物體合成到以人為中心的背景圖像上,同時確保前景物體與背景人物之間的和諧互動及其外觀的一致性。我們的方法包含兩個關鍵設計:(1) 基於MLLMs的區域姿態引導(MRPG),利用MLLMs識別互動區域及互動類型(如持握、舉起),為生成的互動姿態提供從粗到細的約束,並結合人體姿態關鍵點來追蹤動作變化,實施細粒度姿態約束;(2) 細節一致的外觀保持(DCAP),統一了形狀感知的注意力調節機制、多視角外觀損失及背景一致性損失,以確保前景形狀/紋理的一致性及背景人物的忠實再現。隨後,我們為此任務提出了首個名為互動感知的人與物體合成(IHOC)的數據集。在我們數據集上的實驗結果表明,HOComp能有效生成具有一致外觀的和諧人與物體互動,並在質量和數量上均優於相關方法。
English
While existing image-guided composition methods may help insert a foreground
object onto a user-specified region of a background image, achieving natural
blending inside the region with the rest of the image unchanged, we observe
that these existing methods often struggle in synthesizing seamless
interaction-aware compositions when the task involves human-object
interactions. In this paper, we first propose HOComp, a novel approach for
compositing a foreground object onto a human-centric background image, while
ensuring harmonious interactions between the foreground object and the
background person and their consistent appearances. Our approach includes two
key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes
MLLMs to identify the interaction region as well as the interaction type (e.g.,
holding and lefting) to provide coarse-to-fine constraints to the generated
pose for the interaction while incorporating human pose landmarks to track
action variations and enforcing fine-grained pose constraints; and (2)
Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware
attention modulation mechanism, a multi-view appearance loss, and a background
consistency loss to ensure consistent shapes/textures of the foreground and
faithful reproduction of the background human. We then propose the first
dataset, named Interaction-aware Human-Object Composition (IHOC), for the task.
Experimental results on our dataset show that HOComp effectively generates
harmonious human-object interactions with consistent appearances, and
outperforms relevant methods qualitatively and quantitatively.