WorldForge: 훈련 없이 가이던스를 활용한 비디오 확산 모델에서의 창발적 3D/4D 생성 기술

초록

최근 비디오 확산 모델은 풍부한 잠재 세계 사전 지식 덕분에 공간 지능 작업에서 강력한 잠재력을 보여주고 있습니다. 그러나 이러한 잠재력은 제한된 제어 가능성과 기하학적 불일치로 인해 방해를 받아, 강력한 사전 지식과 3D/4D 작업에서의 실제 사용 사이에 간극이 생기고 있습니다. 결과적으로, 현재의 접근 방식은 사전 학습된 지식을 저하시킬 위험과 높은 계산 비용을 초래하는 재학습 또는 미세 조정에 의존하는 경우가 많습니다. 이를 해결하기 위해, 우리는 WorldForge를 제안합니다. 이는 훈련이 필요 없는 추론 시점 프레임워크로, 세 가지 긴밀하게 결합된 모듈로 구성되어 있습니다. Intra-Step Recursive Refinement은 추론 중에 네트워크 예측을 반복적으로 최적화하여 정확한 궤적 주입을 가능하게 하는 재귀적 정제 메커니즘을 도입합니다. Flow-Gated Latent Fusion은 광학 흐름 유사성을 활용하여 잠재 공간에서 모션과 외관을 분리하고, 모션 관련 채널에 선택적으로 궤적 지도를 주입합니다. Dual-Path Self-Corrective Guidance는 지도된 경로와 지도되지 않은 경로를 비교하여 노이즈가 있거나 잘못 정렬된 구조적 신호로 인한 궤적 드리프트를 적응적으로 수정합니다. 이들 구성 요소는 훈련 없이도 세밀한 궤적 정렬 지도를 주입하여 정확한 모션 제어와 사실적인 콘텐츠 생성을 동시에 달성합니다. 다양한 벤치마크에서의 광범위한 실험을 통해 우리의 방법이 사실성, 궤적 일관성, 시각적 충실도에서 우수함을 입증했습니다. 이 연구는 제어 가능한 비디오 합성을 위한 새로운 플러그 앤 플레이 패러다임을 소개하며, 공간 지능을 위한 생성적 사전 지식을 활용하는 새로운 관점을 제시합니다.

English

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

WorldForge: 훈련 없이 가이던스를 활용한 비디오 확산 모델에서의 창발적 3D/4D 생성 기술

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

초록

Support