MemDLM: 메모리 향상 DLM 훈련

초록

확산 언어 모델(DLM)은 전체 주의 병렬 디코딩 및 유연한 생성과 같은 자기회귀(AR) 모델 대비 매력적인 장점을 제공합니다. 그러나 DLM은 현저한 훈련-추론 불일치 문제를 겪습니다: DLM은 정적이고 단일 단계의 마스크 예측 목표로 훈련되지만, 다단계 점진적 잡음 제거 궤적을 통해 배포됩니다. 우리는 이중 수준 최적화를 통해 시뮬레이션된 잡음 제거 과정을 훈련에 내재화하여 이러한 격차를 줄이는 MemDLM(메모리 강화 DLM)을 제안합니다. 내부 루프는 각 샘플의 지역적 궤적 경험을 포착하는 매개변수 메모리를 형성하는 빠른 가중치 집합을 업데이트하는 반면, 외부 루프는 이 메모리에 조건화되어 기본 모델을 업데이트합니다. 토큰 표현에서 매개변수로의 암기 부담을 전가함으로써 MemDLM은 더 빠른 수렴과 더 낮은 훈련 손실을 달성합니다. 더욱이 내부 루프는 추론 시점에 적응 단계로 재활성화될 수 있어 장문맥 이해에서 추가적인 성능 향상을 가져옵니다. 우리는 추론 시점에 활성화될 때 이 매개변수 메모리가 발생적인 내부 가중치 검색 메커니즘으로 작동하여, MemDLM이 어려운 건초 더미 속 바늘 검색 과제에서 토큰 수준 주의 병목 현상을 추가로 줄이는 데 도움을 준다는 사실을 발견했습니다. 코드: https://github.com/JarvisPei/MemDLM.

English

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.

MemDLM: 메모리 향상 DLM 훈련

MemDLM: Memory-Enhanced DLM Training

초록

Support