通過分解視覺代理的直接三維感知物體插入

摘要

對象插入旨在將參考對象無縫合成至背景圖像的指定區域。近期基於擴散模型的方法雖能實現高視覺品質，但將插入簡化為單純的2D影像修復任務，缺乏對物體3D姿態的明確控制，限制了其實用性。我們提出DIRECT（用於參考合成與目標整合的分解式注入）框架，這是一個新穎的系統，能將互動式姿態操作與高保真2D影像生成相結合，實現姿態可控的對象插入。本方法將插入條件分解為三個互補組成部分：從參考對象擷取視覺細節的外觀引導、根據使用者調整的3D代理生成的幾何引導，以及來自目標背景的上下文引導。透過獨立路徑注入這些條件，DIRECT避免了特徵混雜，同時保留參考外觀、遵循使用者指定的姿態，並使對象適應目標場景。我們還引入自動化數據構建流程，以提升訓練數據的多樣性與品質。實驗結果顯示，DIRECT在幾何可控性與視覺品質上均優於先前方法。

English

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.