ChatPaper.aiChatPaper

FrameDiffuser:基于G缓冲条件的神经前向帧渲染扩散模型

FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

December 18, 2025
作者: Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch
cs.AI

摘要

交互式应用中的神经渲染需要将几何与材质属性(G-buffer)逐帧转换为具有真实光照效果的光照真实图像。尽管当前基于扩散模型的方法在G-buffer条件图像合成方面展现出潜力,但仍存在关键局限:RGBX等单帧模型独立生成各帧而缺乏时间一致性,DiffusionRenderer等视频模型则因计算成本过高难以适配主流游戏设备,且需预先获取完整序列,无法满足用户输入决定后续帧的交互场景。我们提出FrameDiffuser——一种自回归神经渲染框架,通过联合G-buffer数据与模型自身历史输出来生成时间一致的光照真实帧。在完成首帧渲染后,该框架仅需输入包含几何、材质与表面属性的G-buffer数据,同时利用前序生成帧进行时序引导,可在数百至数千帧范围内保持稳定的时序一致性生成。我们的双条件架构融合了ControlNet的结构引导与ControlLoRA的时序连贯性控制,通过三阶段训练策略实现稳定自回归生成。该模型针对特定环境进行专门化训练,将一致性与推理速度置于泛化能力之上,实验表明相较于通用方法,环境定制化训练能实现更优异的光照真实感,精准还原光影与反射效果。
English
Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
PDF32December 20, 2025