DiLightNet：拡散ベース画像生成のための細粒度照明制御

要旨

本論文では、テキスト駆動型の拡散モデルを用いた画像生成において、細かな照明制御を可能にする新しい手法を提案する。既存の拡散モデルは、任意の照明条件下で画像を生成する能力を既に有しているが、追加のガイダンスなしでは、画像の内容と照明が相関する傾向がある。さらに、テキストプロンプトだけでは、詳細な照明設定を記述するのに十分な表現力が欠けている。画像生成中にコンテンツクリエイターが照明を細かく制御できるようにするため、我々はテキストプロンプトを、ラディアンスヒント（目標照明下での均質な標準マテリアルを用いたシーンジオメトリの可視化）という形で詳細な照明情報で拡張する。しかし、ラディアンスヒントを生成するために必要なシーンジオメトリは未知である。我々の重要な観察は、拡散プロセスをガイドするだけでよいため、正確なラディアンスヒントは必要なく、拡散モデルを正しい方向に導くだけで十分であるということである。この観察に基づき、画像生成中の照明制御のための3段階の手法を導入する。第1段階では、標準の事前学習済み拡散モデルを活用して、制御されていない照明下での暫定画像を生成する。次に、第2段階では、暫定画像から推定された前景オブジェクトの粗い形状に基づいて計算されたラディアンスヒントを使用して、目標照明をDiLightNetという改良された拡散モデルに渡し、生成された画像の前景オブジェクトを再合成および精緻化する。テクスチャの詳細を保持するために、暫定合成画像のニューラルエンコーディングをラディアンスヒントに乗算してからDiLightNetに渡す。最後に、第3段階では、前景オブジェクトの照明と一貫性のある背景を再合成する。我々は、様々なテキストプロンプトと照明条件において、照明制御された拡散モデルを実証し、検証する。

English

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

DiLightNet：拡散ベース画像生成のための細粒度照明制御

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

要旨

Support