DiffusionBrowser:基于多分支解码器的交互式扩散预览系统
DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders
December 15, 2025
作者: Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
cs.AI
摘要
视频扩散模型虽已彻底改变了生成式视频合成技术,但仍存在生成结果不精确、速度缓慢且生成过程不透明等问题,导致用户需要长时间处于等待状态。本研究提出DiffusionBrowser——一个与模型无关的轻量级解码器框架,允许用户在去噪过程的任意时间点(时间步或Transformer模块)交互式生成预览。我们的模型能以超实时4倍以上速度(4秒视频仅需不到1秒)生成包含RGB和场景本征的多模态预览表征,这些预览与最终视频具有一致的外观和运动特征。通过训练后的解码器,我们证明了在中间噪声步骤中通过随机性重注入和模态导向实现交互式生成引导的可行性,从而解锁了全新的控制能力。此外,我们利用学习得到的解码器对模型进行系统性探查,揭示了在原本如同黑箱的去噪过程中,场景、物体及其他细节是如何逐步组合构建的。
English
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4times real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.