리얼마스터: 렌더링된 장면을 사실적인 비디오로 고양시키기

초록

최첨단 비디오 생성 모델은 놀라운 수준의 사실적 화질을 구현하지만, 생성된 콘텐츠가 특정 장면 요구사항과 정확히 일치하도록 제어하는 정밀도가 부족합니다. 더욱이 명시적인 기하학적 구조가 기반이 되지 않아 이러한 모델들은 3D 일관성을 보장할 수 없습니다. 반면, 3D 엔진은 모든 장면 요소에 대한 세밀한 제어를 제공하며 설계 상 본질적인 3D 일관성을 갖추고 있지만, 그 결과물은 여전히 '불쾌한 골짜기' 현상에서 벗어나지 못하는 경우가 많습니다. 이 시뮬레이션-현실 간격을 해소하기 위해서는 출력이 입력의 기하학적 구조와 동역학을 정확히 보존해야 하는 구조적 정밀도와, 사실적 화질을 달성하기 위해 재질, 조명, 텍스처 등을 전체적으로 변환해야 하는 전역 의미론적 변환이 모두 필요합니다. 본 논문에서는 비디오 확산 모델을 활용하여 렌더링된 비디오를 사실적 비디오로 승격시키면서도 3D 엔진의 출력과 완전한 정합성을 유지하는 방법인 RealMaster를 제안합니다. 이 모델을 학습시키기 위해 앵커 기반 전파 전략으로 paired 데이터셋을 생성합니다. 이 전략에서는 첫 번째 프레임과 마지막 프레임의 사실감을 향상시키고, 기하학적 조건화 정보를 사용하여 중간 프레임들에 이를 전파합니다. 그런 다음 이 paired 비디오들에 대해 IC-LoRA를 학습하여 파이프라인의 고품질 출력을 일반화된 모델로 증류합니다. 이 모델은 파이프라인의 제약을 넘어서 시퀀스 중간에 등장하는 객체와 캐릭터를 처리하고 앵커 프레임 없이도 추론을 가능하게 합니다. 복잡한 GTA-V 시퀀스에 대해 평가한 결과, RealMaster는 기존 비디오 편집 기준선을 크게 능가하며, 원본 3D 제어로 지정된 기하학, 동역학 및 정체성을 보존하면서 사실적 화질을 향상시켰습니다.

English

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

리얼마스터: 렌더링된 장면을 사실적인 비디오로 고양시키기

RealMaster: Lifting Rendered Scenes into Photorealistic Video

초록

Support