RealMaster:将渲染场景提升为照片级真实感视频
RealMaster: Lifting Rendered Scenes into Photorealistic Video
March 24, 2026
作者: Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar
cs.AI
摘要
当前顶尖的视频生成模型虽能实现惊人的照片级真实感,但在精确控制生成内容与特定场景需求对齐方面仍存在不足。此外,由于缺乏底层显式几何结构,这些模型无法保证三维一致性。反观三维引擎,其能对每个场景元素进行细粒度控制,并通过设计原生保障三维一致性,但输出效果往往仍陷于"恐怖谷"困境。要弥合这种仿真与真实之间的鸿沟,既需要结构精度(输出必须精确保留输入的几何结构与动态特性),又需要全局语义转换(材质、光照与纹理需进行整体性转换以实现照片真实感)。我们提出RealMaster方法,利用视频扩散模型将渲染视频提升至照片级真实感,同时保持与三维引擎输出的完全对齐。为训练该模型,我们通过基于锚点的传播策略生成配对数据集:首尾帧经真实感增强后,借助几何条件线索在中间帧中进行传播。随后在这些配对视频上训练IC-LoRA模型,将流程中的高质量输出蒸馏至可突破流程限制的模型中,使其能处理序列中途出现的物体与角色,并实现无需锚点帧的推理。在复杂GTA-V序列上的评估表明,RealMaster显著优于现有视频编辑基线方法,在提升真实感的同时完整保留了原始三维控制所规定的几何结构、动态特性与身份特征。
English
State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.