透過可分解流匹配提升漸進生成技術

摘要

生成高維視覺模態是一項計算密集型的任務。常見的解決方案是漸進式生成，其中輸出以從粗到細的頻譜自回歸方式合成。雖然擴散模型受益於去噪的從粗到細特性，但顯式的多階段架構卻很少被採用。這些架構增加了整體方法的複雜性，引入了對自定義擴散公式、依賴於分解的階段轉換、臨時採樣器或模型級聯的需求。我們提出的貢獻，可分解流匹配（Decomposable Flow Matching, DFM），是一個簡單而有效的框架，用於視覺媒體的漸進式生成。DFM在用戶定義的多尺度表示（如拉普拉斯金字塔）的每一層上獨立應用流匹配。正如我們的實驗所示，該方法提升了圖像和視頻的視覺質量，相比於先前的多階段框架，展現出更優異的結果。在Imagenet-1k 512px上，DFM在相同訓練計算量下，相較於基礎架構實現了35.2%的FDD分數提升，並比最佳基線模型高出26.4%。當應用於大型模型（如FLUX）的微調時，DFM顯示出更快的訓練分佈收斂速度。關鍵在於，所有這些優勢都是通過單一模型、架構簡潔性以及對現有訓練流程的最小修改實現的。

English

Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.

透過可分解流匹配提升漸進生成技術

Improving Progressive Generation with Decomposable Flow Matching

摘要

Support