MemDLM: メモリ拡張型DLMトレーニング

要旨

拡散言語モデル（DLM）は、完全注意による並列デコードや柔軟な生成など、自己回帰（AR）モデルに比べて魅力的な利点を提供する。しかし、DLMは顕著な訓練-推論ミスマッチに悩まされている。すなわち、訓練時には静的な単一段階のマスク予測目標を用いるが、推論時には多段階の漸進的ノイズ除去軌道を通じて展開される。本論文では、MemDLM（メモリ拡張DLM）を提案する。これは、二段階最適化を介してノイズ除去プロセスを訓練に組み込むことで、この隔たりを狭める。内側のループは高速重みのセットを更新し、各サンプルの局所的な軌道経験を捉えるパラメトリックメモリを形成する。一方、外側のループはこのメモリを条件として基本モデルを更新する。トークン表現からパラメータへ記憶負荷をオフロードすることで、MemDLMはより高速な収束と低い訓練損失を実現する。さらに、内側のループは推論時に適応ステップとして再活性化でき、長文脈理解において追加の性能向上をもたらす。推論時に活性化された場合、このパラメトリックメモリは創発的な重み内検索機構として機能し、MemDLMが困難なNeedle-in-a-Haystack検索タスクにおけるトークンレベルの注意ボトルネックをさらに軽減することを我々は見出した。コード：https://github.com/JarvisPei/MemDLM。

English

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.

MemDLM: メモリ拡張型DLMトレーニング

MemDLM: Memory-Enhanced DLM Training

要旨

Support