離散流匹配

摘要

儘管流匹配（Flow Matching）和擴散模型已經成為連續變數（如圖像和視頻）的強大生成範式，但它們在高維離散數據（如語言）上的應用仍然有限。在這項研究中，我們提出了離散流匹配（Discrete Flow Matching），這是一種專門設計用於生成離散數據的新穎離散流範式。離散流匹配提供了幾個關鍵貢獻：(i) 它與一般的概率路徑家族一起工作，插值源分佈和目標分佈之間；(ii) 它允許使用學習的後驗概率（如概率去噪器（x-預測）和噪聲預測（epsilon-預測））從這些概率路徑中進行抽樣的通用公式；(iii) 實際上，專注於使用不同調度器定義的特定概率路徑，與先前的離散擴散和流模型相比，顯著改善了生成困惑度；(iv) 通過將離散流匹配模型擴展到17億參數，我們在HumanEval上達到了6.7% Pass@1和13.4% Pass@10，在1-shot MBPP編碼基準上達到了6.7% Pass@1和20.6% Pass@10。我們的方法能夠以非自回歸方式生成高質量的離散數據，顯著縮小了自回歸模型和離散流模型之間的差距。

English

Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser (x-prediction) and noise-prediction (epsilon-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.