RealMaster：将渲染场景提升为逼真视频

摘要

当前顶尖的视频生成模型虽能实现惊人的照片级真实感，但在精确控制生成内容与特定场景需求对齐方面仍存在不足。此外，由于缺乏底层显式几何结构，这些模型无法保证三维一致性。相反，三维引擎能对每个场景元素进行细粒度控制，并通过设计原生保障三维一致性，但其输出效果往往陷入"恐怖谷效应"。要弥合这种仿真与真实之间的鸿沟，既需要结构精度（输出必须精确保留输入的几何结构与动态特性），又需要全局语义转换（材质、光照与纹理需整体转换以实现照片真实感）。我们提出RealMaster方法，利用视频扩散模型将渲染视频提升至照片级真实感，同时完全保持与三维引擎输出的对齐。为训练该模型，我们通过基于锚点的传播策略生成配对数据集：首尾帧经真实感增强后，利用几何条件线索在中间帧间传播。随后在这些配对视频上训练IC-LoRA，将流程的高质量输出蒸馏至模型中，使其突破流程限制实现泛化，能处理序列中途出现的物体与角色，且无需锚帧即可完成推理。在复杂GTA-V序列上的评估表明，RealMaster显著优于现有视频编辑基线，在提升真实感的同时完整保留了原始三维控制指定的几何结构、动态特性与身份特征。

English

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.