生成的ワールドレンダラー

要旨

生成的な逆レンダリングおよびフォワードレンダリングを実世界のシナリオに拡張する際、既存の合成データセットの限定的なリアリズムと時間的一貫性がボトルネックとなっている。この頑固なドメインギャップを埋めるため、視覚的に複雑なAAAゲームから構築した大規模な動的データセットを提案する。新しいデュアルスクリーン・ステッチング収録手法を用いて、多様なシーン、視覚効果、環境（悪天候やモーションブラー変種を含む）において、同期されたRGBおよび5種類のG-bufferチャネルからなる400万連続フレーム（720p/30FPS）を抽出した。本データセットは双方向レンダリングを独自に推進する：ロバストな実世界環境での幾何学・マテリアル分解を可能とし、高精細なG-buffer誘導型ビデオ生成を促進する。さらに、教師データなしで逆レンダリングの実世界性能を評価するため、意味的・空間的・時間的一貫性を測定する新しいVLMベースの評価プロトコルを提案する。実験により、当データでファインチューニングした逆レンダラーが優れたクロスデータセット一般化性能と制御可能な生成を達成し、我々のVLM評価が人間の判断と強い相関を持つことを実証した。当ツールキットと組み合わせたフォワードレンダラーにより、ユーザーはテキストプロンプトを用いてG-bufferからAAAゲームのスタイルを編集可能となる。

English

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

生成的ワールドレンダラー

Generative World Renderer

要旨

Support