生成多圖合成數據以進行文本到圖像定制

摘要

對文本到圖像模型進行定制，使用戶能夠插入自定義概念並在未見過的場景中生成這些概念。現有方法要麼依賴昂貴的測試時間優化，要麼在單圖像訓練數據集上訓練編碼器而缺乏多圖像監督，導致圖像質量較差。我們提出了一種簡單的方法來解決這兩個限制。首先，我們利用現有的文本到圖像模型和3D數據集創建了一個高質量的合成定制數據集（SynCD），其中包含同一對象在不同光線、背景和姿勢下的多張圖像。然後，我們提出了一種基於共享注意機制的新編碼器架構，更好地將輸入圖像的細粒度視覺細節納入其中。最後，我們提出了一種新的推理技術，通過對文本和圖像引導向量進行歸一化，從而減輕推理過程中的過曝問題。通過大量實驗，我們展示了我們的模型，在合成數據集上訓練，使用所提出的編碼器和推理算法，優於現有的無調整方法在標准定制基準測試中的表現。

English

Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments, we show that our model, trained on the synthetic dataset with the proposed encoder and inference algorithm, outperforms existing tuning-free methods on standard customization benchmarks.

生成多圖合成數據以進行文本到圖像定制

Generating Multi-Image Synthetic Data for Text-to-Image Customization

摘要

Support