MemDLM：記憶體增強型DLM訓練

摘要

擴散語言模型（DLMs）相較自迴歸模型具有顯著優勢，例如全注意力並行解碼和靈活生成能力。然而其存在明顯的訓練-推論失配問題：DLMs採用靜態單步掩碼預測目標進行訓練，卻需通過多步漸進去噪軌跡進行部署。我們提出記憶增強型DLM（MemDLM），通過雙層優化將模擬去噪過程嵌入訓練階段以縮小這一差距。內層循環更新一組快速權重，形成參數化記憶體以捕獲每個樣本的局部軌跡經驗，而外層循環則基於此記憶體更新基礎模型。通過將記憶壓力從詞元表徵卸載至參數，MemDLM實現了更快的收斂速度與更低的訓練損失。此外，內層循環可在推論時作為自適應步驟重新激活，從而在長上下文理解任務中產生額外增益。我們發現，在推論時激活的參數化記憶體會湧現為一種權重內檢索機制，幫助MemDLM在挑戰性的"大海撈針"檢索任務中進一步緩解詞元級注意力瓶頸。代碼地址：https://github.com/JarvisPei/MemDLM。

English

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.

MemDLM：記憶體增強型DLM訓練

MemDLM: Memory-Enhanced DLM Training

摘要

Support