DiLightNet：面向基于扩散的图像生成的细粒度光照控制

摘要

本文提出了一种新颖的方法，用于在基于扩散的图像生成过程中实现细粒度的光照控制。虽然现有的扩散模型已经能够在任何光照条件下生成图像，但在没有额外指导的情况下，这些模型往往会将图像内容和光照进行关联。此外，文本提示缺乏描述详细光照设置所需的表现力。为了在图像生成过程中为内容创建者提供对光照的细粒度控制，我们通过在文本提示中增加详细的光照信息，以辐射提示的形式，即在目标光照下使用具有均匀规范材质的场景几何的可视化。然而，生成辐射提示所需的场景几何是未知的。我们的关键观察是，我们只需要引导扩散过程，因此不需要精确的辐射提示；我们只需要将扩散模型指向正确的方向。基于这一观察，我们介绍了一种三阶段方法来控制图像生成过程中的光照。在第一阶段，我们利用标准预训练的扩散模型在未受控制的光照下生成临时图像。接下来，在第二阶段，我们通过将目标光照传递给一个经过改进的扩散模型（称为DiLightNet），使用从临时图像推断出的前景对象的粗略形状计算的辐射提示，对生成的图像中的前景对象进行重新合成和细化。为了保留纹理细节，我们将辐射提示与临时合成图像的神经编码相乘，然后将其传递给DiLightNet。最后，在第三阶段，我们重新合成背景，使其与前景对象上的光照保持一致。我们在各种文本提示和光照条件下展示和验证了我们的光照控制扩散模型。

English

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

DiLightNet：面向基于扩散的图像生成的细粒度光照控制

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

摘要

Support