MemDLM：内存增强型深度学习模型训练

摘要

扩散语言模型（DLM）相较于自回归模型具有显著优势，例如可实现全注意力并行解码和灵活生成。然而这类模型存在明显的训练-推理失配问题：DLM采用静态单步掩码预测目标进行训练，实际部署时却需通过多步渐进去噪路径生成。我们提出MemDLM（记忆增强型DLM），通过双层级优化将模拟去噪过程嵌入训练阶段以缩小这一差距。内层循环通过更新快速权重集合形成参数化记忆体，捕获每个样本的局部轨迹经验；外层循环则基于该记忆体更新基础模型。通过将记忆压力从词元表征转移至参数系统，MemDLM实现了更快的收敛速度与更低的训练损失。此外，在推理阶段重新启用内层循环可作为自适应步骤，显著提升长文本理解能力。我们发现，在推理时激活的参数化记忆体会形成一种新兴的权重内检索机制，帮助MemDLM在具有挑战性的"大海捞针"检索任务中进一步缓解词元级注意力瓶颈。代码地址：https://github.com/JarvisPei/MemDLM。

English

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.