CustomNet：在文本到圖像擴散模型中實現零樣本物體定制與可變視角

摘要

將自定義物件納入影像生成中，在文本到影像生成中呈現一個吸引人的特點。然而，現有基於優化和編碼器的方法受到一些缺點的阻礙，如耗時的優化、不足的身份保留以及普遍存在的複製-粘貼效應。為了克服這些限制，我們引入了CustomNet，一種新穎的物件自定義方法，明確將3D新視角合成能力融入物件自定義過程中。這種整合有助於調整空間位置關係和觀點，產生多樣的輸出，同時有效地保留物件身份。此外，我們引入精心設計，通過文本描述或特定用戶定義的圖像實現位置控制和靈活的背景控制，克服現有3D新視角合成方法的限制。我們進一步利用數據集構建流程，更好地處理現實世界的物件和複雜背景。憑藉這些設計，我們的方法實現了零樣本物件自定義，無需測試時間優化，同時實現對觀點、位置和背景的同時控制。因此，我們的CustomNet確保了增強的身份保留並生成多樣、和諧的輸出。

English

Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.

CustomNet：在文本到圖像擴散模型中實現零樣本物體定制與可變視角

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

摘要

Support