通过分解的视觉代理实现直接的三维感知对象插入

摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域中。最近的扩散方法虽然实现了高视觉质量，但将插入简化为二维修复任务，无法对对象的三维姿态进行显式控制，从而限制了其实用性。我们提出DIRECT（参考组合与目标集成的分解注入框架），这是一种将交互式姿态操作与高保真二维图像合成相结合的新颖框架，可实现姿态可控的物体插入。该方法将插入条件分解为三个互补组件：从参考对象捕获视觉细节的外观引导、基于用户调整的三维代理生成的几何引导，以及来自目标背景的上下文引导。通过独立路径注入这些条件，DIRECT避免了特征纠缠，同时保留了参考外观、遵循用户指定的姿态，并使对象适应目标场景。我们还引入了一条自动数据构建流水线，以提高训练数据的多样性和质量。实验表明，DIRECT在几何可控性和视觉质量方面均优于先前方法。

English

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.