DreamRenderer: 大規模テキスト画像モデルにおけるマルチインスタンス属性制御の制御

要旨

深度やキャニーエッジを条件とした画像生成手法など、画像を条件とした生成方法は、精密な画像合成において顕著な能力を発揮しています。しかし、既存のモデルは複数のインスタンス（または領域）の内容を正確に制御するのに依然として苦戦しています。FLUXや3DISといった最先端のモデルでさえ、インスタンス間の属性漏れといった課題に直面しており、ユーザー制御を制限しています。これらの課題に対処するため、我々はFLUXモデルを基盤としたトレーニング不要のアプローチであるDreamRendererを提案します。DreamRendererは、ユーザーがバウンディングボックスやマスクを通じて各インスタンスの内容を制御できるようにしつつ、全体の視覚的調和を保証します。我々は2つの主要な革新を提案します：1）ハードテキスト属性バインディングのためのブリッジ画像トークン。これは、テキストデータのみで事前学習されたT5テキスト埋め込みが、Joint Attention中に各インスタンスの正しい視覚属性を確実にバインドするために、複製された画像トークンをブリッジトークンとして使用します。2）重要な層にのみ適用されるハード画像属性バインディング。FLUXの分析を通じて、インスタンス属性レンダリングに責任を持つ重要な層を特定し、これらの層にのみハード画像属性バインディングを適用し、他の層ではソフトバインディングを使用します。このアプローチにより、画像品質を保ちつつ、精密な制御を実現します。COCO-POSおよびCOCO-MIGベンチマークでの評価により、DreamRendererはFLUXに比べてImage Success Ratioを17.7%向上させ、GLIGENや3DISのようなレイアウトから画像へのモデルの性能を最大26.8%向上させることが示されました。プロジェクトページ：https://limuloo.github.io/DreamRenderer/。

English

Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.

DreamRenderer: 大規模テキスト画像モデルにおけるマルチインスタンス属性制御の制御

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

要旨

Support