扩散应进入语言模型的哪个位置？几何引导的隐状态替换

摘要

连续扩散语言模型在性能上落后于自回归变换器，部分原因在于扩散过程被应用于不适合语言去噪和令牌恢复的空间中。我们提出DiHAL——一种几何引导的扩散-变换器混合模型，其核心是探讨扩散应如何介入预训练变换器。DiHAL利用基于几何特性的代理指标对层进行评分，选取适合扩散的隐藏状态接口，并用扩散桥替换较低层的变换器前缀，同时保留上层结构和原始语言模型头部。通过重建选定层的隐藏状态而非令牌，DiHAL避免了直接的连续到离散恢复。在80亿参数规模骨干模型上的实验表明，在固定桥训练协议下，几何评分可有效预测浅层插入位置；并且在匹配扩散/恢复训练预算的诊断对比中，隐藏状态恢复性能优于连续扩散基线。这些结果表明，隐藏状态的几何特性有助于识别预训练语言模型中哪些位置适合进行基于扩散的替换。

English

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.