WorldForge:透過無訓練引導解鎖視頻擴散模型中的湧現3D/4D生成
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
September 18, 2025
作者: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
cs.AI
摘要
近期的視頻擴散模型因其豐富的潛在世界先驗知識,在空間智能任務中展現出強大的潛力。然而,這種潛力受到其有限的可控性和幾何不一致性的阻礙,導致其強大的先驗知識與在3D/4D任務中的實際應用之間存在差距。因此,當前的方法通常依賴於重新訓練或微調,這不僅可能損害預訓練知識,還帶來高昂的計算成本。為解決這一問題,我們提出了WorldForge,這是一個無需訓練、在推理時運行的框架,由三個緊密耦合的模塊組成。**步驟內遞歸優化**在推理過程中引入了一種遞歸優化機制,通過在每個去噪步驟內反覆優化網絡預測,實現精確的軌跡注入。**流控潛在融合**利用光流相似性在潛在空間中將運動與外觀解耦,並選擇性地將軌跡引導注入與運動相關的通道。**雙路徑自校正引導**通過比較有引導和無引導的去噪路徑,自適應地校正由噪聲或未對齊的結構信號引起的軌跡漂移。這些組件共同作用,無需訓練即可注入細粒度的、與軌跡對齊的引導,實現精確的運動控制和逼真的內容生成。在各種基準測試上的廣泛實驗驗證了我們方法在真實感、軌跡一致性和視覺保真度方面的優越性。這項工作為可控視頻合成引入了一種新穎的即插即用範式,為利用生成先驗進行空間智能提供了新的視角。
English
Recent video diffusion models demonstrate strong potential in spatial
intelligence tasks due to their rich latent world priors. However, this
potential is hindered by their limited controllability and geometric
inconsistency, creating a gap between their strong priors and their practical
use in 3D/4D tasks. As a result, current approaches often rely on retraining or
fine-tuning, which risks degrading pretrained knowledge and incurs high
computational costs. To address this, we propose WorldForge, a training-free,
inference-time framework composed of three tightly coupled modules. Intra-Step
Recursive Refinement introduces a recursive refinement mechanism during
inference, which repeatedly optimizes network predictions within each denoising
step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages
optical flow similarity to decouple motion from appearance in the latent space
and selectively inject trajectory guidance into motion-related channels.
Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths
to adaptively correct trajectory drift caused by noisy or misaligned structural
signals. Together, these components inject fine-grained, trajectory-aligned
guidance without training, achieving both accurate motion control and
photorealistic content generation. Extensive experiments across diverse
benchmarks validate our method's superiority in realism, trajectory
consistency, and visual fidelity. This work introduces a novel plug-and-play
paradigm for controllable video synthesis, offering a new perspective on
leveraging generative priors for spatial intelligence.