ChatPaper.aiChatPaper

LuxDiT:基於視頻擴散變換器的光照估計

LuxDiT: Lighting Estimation with Video Diffusion Transformer

September 3, 2025
作者: Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar, Zian Wang
cs.AI

摘要

從單一圖像或視頻中估算場景照明一直是計算機視覺和圖形學領域的長期挑戰。基於學習的方法受到高動態範圍(HDR)環境地圖真實數據稀缺的限制,這些數據不僅捕捉成本高昂,且多樣性有限。儘管最近的生成模型為圖像合成提供了強大的先驗知識,但照明估算仍然困難重重,這歸因於其對間接視覺線索的依賴、對全局(非局部)上下文推斷的需求,以及高動態範圍輸出的恢復。我們提出了LuxDiT,這是一種新穎的數據驅動方法,它通過微調視頻擴散變壓器來生成基於視覺輸入的HDR環境地圖。我們的模型在包含多樣照明條件的大型合成數據集上進行訓練,學會從間接視覺線索中推斷照明,並能有效泛化到真實世界場景。為了提高輸入與預測環境地圖之間的語義對齊,我們引入了一種低秩適應微調策略,利用收集的HDR全景圖數據集進行訓練。我們的方法能夠生成具有真實角度高頻細節的準確照明預測,在定量和定性評估中均優於現有的最先進技術。
English
Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.
PDF142September 8, 2025