潜在フロートランスフォーマー

要旨

大規模言語モデル（LLMs）の標準的な実装であるTransformerは、通常、数十から数百の離散層で構成されています。層を増やすことで性能が向上する可能性がある一方で、このアプローチは効率性に欠けると指摘されてきました。特に、画像生成における拡散モデルやフローベースモデルが示す連続層の優位性を考えると、その非効率性が顕著です。本研究では、Latent Flow Transformer（LFT）を提案します。LFTは、複数の層をフローマッチングによって学習された単一の輸送演算子に置き換えることで、大幅な圧縮を実現しつつ、元のアーキテクチャとの互換性を維持します。さらに、既存のフローベース手法がカップリングを維持する上で抱える課題に対処するため、Flow Walking（FW）アルゴリズムを導入します。Pythia-410Mモデルにおいて、フローマッチングで学習したLFTは24層のうち6層を圧縮し、2層を直接スキップする場合（LMロジットのKLダイバージェンスが0.529）よりも優れた性能（0.407）を示し、この設計の実現可能性を実証しました。FWで学習した場合、LFTはさらに12層を1層に蒸留し、KLを0.736に低減し、3層をスキップする場合（0.932）を上回り、自己回帰型生成とフローベース生成のパラダイム間のギャップを大幅に縮めました。

English

Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in preserving coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.