潜在流变换器

摘要

Transformer作为大型语言模型（LLMs）的标准实现，通常由数十至数百个离散层构成。尽管增加层数可能提升性能，但这一方法在效率上备受质疑，尤其是在扩散模型和基于流的模型在图像生成领域展现出连续层优越性的背景下。我们提出了潜在流Transformer（LFT），它通过流匹配训练的一个单一学习传输算子替代一组层，实现了显著的压缩，同时保持了与原始架构的兼容性。此外，针对现有基于流的方法在保持耦合性上的局限，我们引入了流漫步（FW）算法。在Pythia-410M模型上，采用流匹配训练的LFT压缩了24层中的6层，其表现优于直接跳过2层的情况（语言模型对数概率的KL散度为0.407对比0.529），验证了该设计的可行性。当结合FW训练时，LFT进一步将12层蒸馏为一层，并将KL散度降至0.736，优于跳过3层的结果（0.932），显著缩小了自回归与基于流生成范式之间的差距。

English

Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in preserving coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.