拡散は言語モデルのどこに導入すべきか？幾何学に基づく隠れ状態置換

要旨

連続拡散言語モデルは自己回帰型トランスフォーマーに劣っているが、その理由の一部は、拡散が言語のノイズ除去やトークン復元に適さない空間で適用されるためである。我々はDiHALを提案する。これは幾何学的誘導による拡散-トランスフォーマーハイブリッドであり、事前学習済みトランスフォーマーにおいて拡散をどこに導入すべきかを問うものである。DiHALは幾何学的な代理指標を用いて層をスコアリングし、拡散に適した隠れ状態インターフェースを選択し、上位層と元のLMヘッドを保持しつつ、トランスフォーマーの下位プレフィックスを拡散ブリッジで置き換える。選択された層の隠れ状態をトークンではなく再構成することにより、DiHALは連続値から離散値への直接的な復元を回避する。8B規模のバックボーンを用いた実験により、幾何学的スコアが固定のブリッジ学習プロトコル下で効果的な浅い挿入層を予測すること、また隠れ状態の復元が、拡散/復元の学習予算を一致させた診断的比較において連続拡散ベースラインよりも改善されることが示された。これらの結果は、隠れ状態の幾何学的性質が、事前学習済み言語モデル内部において拡散ベースの置き換えが可能な位置を特定するのに役立つことを示唆している。

English

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.