Trasformatore a Flusso Latente

Abstract

I Transformer, l'implementazione standard per i grandi modelli linguistici (LLM), sono tipicamente composti da decine a centinaia di strati discreti. Sebbene un numero maggiore di strati possa portare a prestazioni migliori, questo approccio è stato messo in discussione come poco efficiente, specialmente considerando la superiorità degli strati continui dimostrata dai modelli basati su diffusione e flusso per la generazione di immagini. Proponiamo il Latent Flow Transformer (LFT), che sostituisce un blocco di strati con un singolo operatore di trasporto appreso addestrato tramite il flow matching, offrendo una significativa compressione mantenendo la compatibilità con l'architettura originale. Inoltre, affrontiamo le limitazioni dei metodi basati su flusso esistenti nel preservare l'accoppiamento introducendo l'algoritmo Flow Walking (FW). Sul modello Pythia-410M, l'LFT addestrato con flow matching comprime 6 dei 24 strati e supera il salto diretto di 2 strati (divergenza KL dei logit del modello linguistico a 0.407 vs. 0.529), dimostrando la fattibilità di questo design. Quando addestrato con FW, l'LFT distilla ulteriormente 12 strati in uno riducendo la KL a 0.736, superando quella ottenuta saltando 3 strati (0.932), riducendo significativamente il divario tra i paradigmi di generazione autoregressivi e quelli basati su flusso.

English

Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in preserving coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.