大型語言擴散模型

摘要

自回歸模型（ARMs）被廣泛視為大型語言模型（LLMs）的基石。我們通過引入LLaDA來挑戰這一觀念，LLaDA是一個從頭開始在預訓練和監督微調（SFT）範式下訓練的擴散模型。LLaDA通過正向數據遮罩過程和一個反向過程來建模分佈，由一個普通的Transformer參數化以預測遮罩的標記。通過優化一個可能性下界，它提供了一種基於原則的生成方法來進行概率推斷。在廣泛的基準測試中，LLaDA展現出強大的可擴展性，勝過我們自行構建的ARM基準。值得注意的是，LLaDA 8B在上下文學習方面與強大的LLMs如LLaMA3 8B相媲美，在SFT後，在多輪對話等案例研究中展現出令人印象深刻的遵循指示能力。此外，LLaDA解決了反轉詛咒，超越了在反轉詩歌完成任務中的GPT-4o。我們的研究結果確立了擴散模型作為ARM的一個可行且有前途的替代方案，挑戰了上述關鍵LLM能力與ARM固有聯繫的假設。

English

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

大型語言擴散模型

Large Language Diffusion Models

摘要

Support