大型語言擴散模型
Large Language Diffusion Models
February 14, 2025
作者: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
cs.AI
摘要
自回歸模型(ARMs)被廣泛視為大型語言模型(LLMs)的基石。我們通過引入LLaDA來挑戰這一觀念,LLaDA是一個從頭開始在預訓練和監督微調(SFT)範式下訓練的擴散模型。LLaDA通過正向數據遮罩過程和一個反向過程來建模分佈,由一個普通的Transformer參數化以預測遮罩的標記。通過優化一個可能性下界,它提供了一種基於原則的生成方法來進行概率推斷。在廣泛的基準測試中,LLaDA展現出強大的可擴展性,勝過我們自行構建的ARM基準。值得注意的是,LLaDA 8B在上下文學習方面與強大的LLMs如LLaMA3 8B相媲美,在SFT後,在多輪對話等案例研究中展現出令人印象深刻的遵循指示能力。此外,LLaDA解決了反轉詛咒,超越了在反轉詩歌完成任務中的GPT-4o。我們的研究結果確立了擴散模型作為ARM的一個可行且有前途的替代方案,挑戰了上述關鍵LLM能力與ARM固有聯繫的假設。
English
Autoregressive models (ARMs) are widely regarded as the cornerstone of large
language models (LLMs). We challenge this notion by introducing LLaDA, a
diffusion model trained from scratch under the pre-training and supervised
fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data
masking process and a reverse process, parameterized by a vanilla Transformer
to predict masked tokens. By optimizing a likelihood bound, it provides a
principled generative approach for probabilistic inference. Across extensive
benchmarks, LLaDA demonstrates strong scalability, outperforming our
self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong
LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive
instruction-following abilities in case studies such as multi-turn dialogue.
Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal
poem completion task. Our findings establish diffusion models as a viable and
promising alternative to ARMs, challenging the assumption that key LLM
capabilities discussed above are inherently tied to ARMs.Summary
AI-Generated Summary