LuxDiT:基于视频扩散变换器的光照估计
LuxDiT: Lighting Estimation with Video Diffusion Transformer
September 3, 2025
作者: Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar, Zian Wang
cs.AI
摘要
从单张图像或视频中估计场景照明一直是计算机视觉和图形学领域的一项长期挑战。基于学习的方法受限于真实高动态范围(HDR)环境贴图的稀缺性,这些贴图不仅捕获成本高昂,且多样性有限。尽管近期生成模型为图像合成提供了强大的先验知识,但光照估计仍面临困难,原因在于其依赖于间接视觉线索、需要推断全局(非局部)上下文,以及恢复高动态范围输出。我们提出了LuxDiT,一种新颖的数据驱动方法,通过微调视频扩散变换器,以视觉输入为条件生成HDR环境贴图。我们的模型在包含多种光照条件的大型合成数据集上训练,学会了从间接视觉线索中推断光照,并能有效泛化至真实世界场景。为了增强输入与预测环境贴图之间的语义对齐,我们引入了一种基于收集的HDR全景数据集的低秩适应微调策略。该方法能够生成具有真实角度高频细节的精确光照预测,在定量和定性评估中均超越了现有最先进技术。
English
Estimating scene lighting from a single image or video remains a longstanding
challenge in computer vision and graphics. Learning-based approaches are
constrained by the scarcity of ground-truth HDR environment maps, which are
expensive to capture and limited in diversity. While recent generative models
offer strong priors for image synthesis, lighting estimation remains difficult
due to its reliance on indirect visual cues, the need to infer global
(non-local) context, and the recovery of high-dynamic-range outputs. We propose
LuxDiT, a novel data-driven approach that fine-tunes a video diffusion
transformer to generate HDR environment maps conditioned on visual input.
Trained on a large synthetic dataset with diverse lighting conditions, our
model learns to infer illumination from indirect visual cues and generalizes
effectively to real-world scenes. To improve semantic alignment between the
input and the predicted environment map, we introduce a low-rank adaptation
finetuning strategy using a collected dataset of HDR panoramas. Our method
produces accurate lighting predictions with realistic angular high-frequency
details, outperforming existing state-of-the-art techniques in both
quantitative and qualitative evaluations.