잠재 흐름 트랜스포머

초록

대규모 언어 모델(LLMs)의 표준 구현체인 트랜스포머(Transformers)는 일반적으로 수십에서 수백 개의 개별 레이어로 구성됩니다. 더 많은 레이어는 더 나은 성능으로 이어질 수 있지만, 이러한 접근 방식은 확산(diffusion) 및 흐름 기반(flow-based) 모델이 이미지 생성에서 보여준 연속 레이어의 우수성에 비해 효율적이지 않다는 점에서 도전받아 왔습니다. 우리는 Latent Flow Transformer(LFT)를 제안하며, 이는 블록의 여러 레이어를 흐름 매칭(flow matching)을 통해 학습된 단일 전송 연산자로 대체하여 원래 아키텍처와의 호환성을 유지하면서도 상당한 압축을 제공합니다. 또한, 기존 흐름 기반 방법들이 결합(coupling)을 유지하는 데 있어 한계를 극복하기 위해 Flow Walking(FW) 알고리즘을 도입합니다. Pythia-410M 모델에서, 흐름 매칭으로 학습된 LFT는 24개 레이어 중 6개를 압축하며, 2개 레이어를 직접 건너뛰는 것보다 더 나은 성능을 보여줍니다(LM 로짓의 KL 발산이 0.407 대 0.529). 이는 이러한 설계의 실현 가능성을 입증합니다. FW로 학습된 경우, LFT는 12개 레이어를 하나로 더욱 압축하면서 KL을 0.736로 줄이며, 이는 3개 레이어를 건너뛰는 경우(0.932)를 능가합니다. 이는 자기회귀(autoregressive)와 흐름 기반 생성 패러다임 간의 격차를 크게 좁히는 결과입니다.

English

Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in preserving coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.