テキストから画像RGBAインスタンス生成を通じた構成的シーンの生成

要旨

テキストから画像への拡散生成モデルは、手間のかかるプロンプトエンジニアリングのコストをかけることで、高品質な画像を生成することができます。コントロール性は、レイアウト条件付けを導入することで改善されますが、既存の手法にはレイアウトの編集能力やオブジェクト属性の細かい制御が欠けています。マルチレイヤー生成の概念は、これらの制限に対処するための大きな潜在能力を持っていますが、画像インスタンスをシーン構成と同時に生成することで、細かいオブジェクト属性、3D空間内の相対的な位置づけ、およびシーン操作能力の制御が制限されます。本研究では、細かい制御、柔軟性、および相互作用を目的とした新しいマルチステージ生成パラダイムを提案します。インスタンス属性の制御を確保するために、拡散モデルを適応させ、透明情報を持つRGBA画像として孤立したシーンコンポーネントを生成するための新しいトレーニングパラダイムを考案します。複雑な画像を構築するために、これらの事前生成されたインスタンスを利用し、リアルなシーンでコンポーネントをスムーズに組み立てるマルチレイヤー複合生成プロセスを導入します。実験では、RGBA拡散モデルが、オブジェクト属性を精密に制御しながら多様で高品質なインスタンスを生成する能力を持つことを示しています。マルチレイヤー構成を通じて、当社のアプローチが、オブジェクトの外観や位置の細かい制御を備えた高度に複雑なプロンプトから画像を構築し操作することを可能にし、競合手法よりも高い制御度を提供していることを示しています。

English

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.