ChatPaper.aiChatPaper

FrameDiffuser:基于G缓冲条件的神经前向帧渲染扩散模型

FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

December 18, 2025
作者: Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch
cs.AI

摘要

面向交互应用的神经渲染技术需逐帧将几何与材质属性(G缓冲区)转换为具有真实光照效果的逼真图像。尽管当前基于扩散模型的方法在G缓冲区条件化图像合成方面展现出潜力,但仍存在关键局限:RGBX等单帧模型因独立生成各帧而缺乏时间一致性;DiffusionRenderer等视频模型则因计算成本过高难以适配多数消费级游戏设备,且需预先获取完整序列,无法满足用户输入决定后续帧的交互场景需求。我们提出FrameDiffuser——一种自回归神经渲染框架,通过联合利用G缓冲区数据与模型自身历史输出来生成时间连贯的逼真帧序列。在完成首帧渲染后,该框架仅需输入包含几何、材质及表面属性的G缓冲区数据,同时以自生成的前一帧作为时序引导,即可实现数百至数千帧的稳定连贯生成。我们的双条件架构融合了ControlNet的结构引导与ControlLoRA的时序一致性控制能力,并通过三阶段训练策略实现稳定的自回归生成。该模型针对特定环境进行专门化训练,以一致性和推理速度优先于泛化能力,实践证明相较于通用方法,环境专向训练能在光照、阴影和反射等细节上实现更优越的逼真度。
English
Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
PDF32December 20, 2025