コンパス制御：テキストから画像生成のための複数オブジェクト向き制御

要旨

既存のテキストから画像への拡散モデルを制御する手法は強力ではあるものの、物体の向きを精密に制御するといった明示的な3Dオブジェクト中心の制御はできません。本研究では、テキストから画像への拡散モデルにおける複数オブジェクトの向き制御の問題に取り組みます。これにより、各オブジェクトの向きを精密に制御した多様な複数オブジェクトシーンの生成が可能になります。鍵となるアイデアは、拡散モデルをテキストトークンと共に、各オブジェクトに対応する向きを意識したコンパストークンのセットで条件付けることです。軽量なエンコーダネットワークが、オブジェクトの向きを入力としてこれらのコンパストークンを予測します。モデルは、単純な背景上に1つまたは2つの3Dアセットを含む手続き的に生成されたシーンの合成データセットで訓練されます。しかし、このフレームワークを直接訓練すると、向きの制御が不十分になるだけでなく、オブジェクト間の絡み合いが生じます。これを緩和するため、生成プロセスに介入し、各コンパストークンのクロスアテンションマップを対応するオブジェクト領域に制約します。訓練されたモデルは、a) 訓練中に見られなかった複雑なオブジェクトと、b) 2つ以上のオブジェクトを含む複数オブジェクトシーンにおいて、精密な向き制御を達成でき、強い汎化能力を示します。さらに、パーソナライゼーション手法と組み合わせることで、我々の手法は多様なコンテキストにおける新しいオブジェクトの向きを精密に制御します。我々の手法は、広範な評価とユーザスタディにより定量化された、最先端の向き制御とテキストアラインメントを達成します。

English

Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware compass tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

コンパス制御：テキストから画像生成のための複数オブジェクト向き制御

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

要旨

Support