WorldForge：通过无训练引导解锁视频扩散模型中的涌现式3D/4D生成

摘要

近期视频扩散模型凭借其丰富的潜在世界先验知识，在空间智能任务中展现出强大潜力。然而，这种潜力受限于其可控性和几何一致性的不足，导致其强大的先验知识与实际3D/4D任务应用之间存在差距。因此，现有方法往往依赖于重新训练或微调，这不仅可能损害预训练知识，还带来高昂的计算成本。为解决这一问题，我们提出了WorldForge，一个无需训练、在推理时运行的框架，由三个紧密耦合的模块组成。**步内递归优化**在推理过程中引入递归优化机制，通过在每个去噪步骤内反复优化网络预测，实现精确轨迹注入。**流控潜在融合**利用光流相似性，在潜在空间中将运动与外观解耦，并选择性地将轨迹指导注入与运动相关的通道。**双路径自校正指导**通过比较有指导和无指导的去噪路径，自适应地校正由噪声或未对齐的结构信号引起的轨迹漂移。这些组件共同作用，无需训练即可注入细粒度、与轨迹对齐的指导，实现精确的运动控制和逼真的内容生成。跨多个基准的大量实验验证了我们的方法在真实性、轨迹一致性和视觉保真度方面的优越性。本工作为可控视频合成引入了一种新颖的即插即用范式，为利用生成先验进行空间智能提供了新的视角。

English

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

WorldForge：通过无训练引导解锁视频扩散模型中的涌现式3D/4D生成

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

摘要

Support