DiLightNet：針對基於擴散的圖像生成的精細照明控制

摘要

本文提出了一種新穎的方法，用於在以文本驅動的擴散式圖像生成過程中實現精細的照明控制。儘管現有的擴散模型已經能夠在任何照明條件下生成圖像，但在沒有額外指導的情況下，這些模型往往會將圖像內容和照明進行相關性。此外，文本提示缺乏描述詳細照明設置所需的表達能力。為了在圖像生成過程中為內容創作者提供對照明的精細控制，我們在文本提示中增加了詳細的照明信息，以輻射提示的形式呈現，即在目標照明下使用具有均質標準材料的場景幾何的可視化。然而，生成輻射提示所需的場景幾何是未知的。我們的關鍵觀察是我們只需要引導擴散過程，因此並不需要確切的輻射提示；我們只需要將擴散模型指向正確的方向。基於這一觀察，我們介紹了一種三階段方法來控制圖像生成過程中的照明。在第一階段，我們利用標準預訓練的擴散模型在未受控制的照明下生成臨時圖像。接下來，在第二階段，通過將目標照明傳遞給一個經過改進的擴散模型（名為DiLightNet），使用從臨時圖像推斷出的前景對象的粗略形狀計算的輻射提示，來重新合成並精煉生成圖像中的前景對象。為了保留紋理細節，我們將輻射提示與臨時合成圖像的神經編碼相乘，然後將其傳遞給DiLightNet。最後，在第三階段，我們重新合成背景，使其與前景對象上的照明保持一致。我們在各種文本提示和照明條件下展示並驗證了我們的照明控制擴散模型。

English

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

DiLightNet：針對基於擴散的圖像生成的精細照明控制

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

摘要

Support