通过可分解流匹配提升渐进式生成质量

摘要

生成高维视觉模态是一项计算密集型的任务。常见的解决方案是渐进式生成，即输出以从粗到细的频谱自回归方式合成。尽管扩散模型受益于去噪的从粗到细特性，但显式的多阶段架构却很少被采用。这些架构增加了整体方法的复杂性，引入了对定制扩散公式、依赖于分解的阶段转换、特定采样器或模型级联的需求。我们的贡献——可分解流匹配（Decomposable Flow Matching, DFM），是一个简单而有效的框架，用于视觉媒体的渐进生成。DFM在用户定义的多尺度表示（如拉普拉斯金字塔）的每一层级上独立应用流匹配。如我们的实验所示，该方法提升了图像和视频的视觉质量，相较于先前的多阶段框架，展现了更优的结果。在Imagenet-1k 512px数据集上，DFM在相同训练计算量下，相较于基础架构实现了35.2%的FDD分数提升，相较于表现最佳的基线提升了26.4%。当应用于大型模型（如FLUX）的微调时，DFM显示出更快的训练分布收敛速度。重要的是，所有这些优势均通过单一模型、架构简洁性以及对现有训练流程的最小修改得以实现。

English

Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.

通过可分解流匹配提升渐进式生成质量

Improving Progressive Generation with Decomposable Flow Matching

摘要

Support