LuxDiT: ビデオ拡散トランスフォーマーを用いた照明推定

要旨

単一の画像や映像からシーンの照明を推定することは、コンピュータビジョンおよびグラフィックスにおける長年の課題である。学習ベースのアプローチは、高ダイナミックレンジ（HDR）環境マップの実データの不足に制約されており、その取得は高コストで多様性も限られている。最近の生成モデルは画像合成のための強力な事前分布を提供するが、照明推定は間接的な視覚的手がかりへの依存、グローバル（非局所的）な文脈の推論、および高ダイナミックレンジ出力の復元が必要なため、依然として困難である。本研究では、LuxDiTという新しいデータ駆動型アプローチを提案する。これは、ビデオ拡散トランスフォーマーを微調整し、視覚的入力に基づいてHDR環境マップを生成するものである。多様な照明条件を持つ大規模な合成データセットで訓練された本モデルは、間接的な視覚的手がかりから照明を推論し、実世界のシーンに効果的に一般化する。入力と予測された環境マップ間の意味的整合性を向上させるため、収集したHDRパノラマデータセットを用いた低ランク適応微調整戦略を導入する。本手法は、現実的な角度的高周波詳細を伴う正確な照明予測を生成し、定量的および定性的な評価において既存の最先端技術を凌駕する。

English

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

LuxDiT: ビデオ拡散トランスフォーマーを用いた照明推定

LuxDiT: Lighting Estimation with Video Diffusion Transformer

要旨

Support