改進遞迴式Transformer:混合LoRA方法
Improving Recursive Transformers with Mixture of LoRAs
December 14, 2025
作者: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian
cs.AI
摘要
遞迴式Transformer中的參數共享雖能縮減模型規模,卻會導致層間表達力坍縮。我們提出LoRA混合機制(MoL),這是一種輕量級條件計算架構,透過在共享前饋網路(FFN)內插入低秩適應(LoRA)專家模組。與先前添加固定或外部適配器的方法不同,MoL無需解綁骨幹參數即可實現共享FFN的權重空間調製。我們預訓練了現代化遞迴架構ModernALBERT,整合了旋轉位置編碼、GeGLU激活函數、FlashAttention注意力機制以及基於蒸餾的初始化策略。在GLUE、SQuAD-v2和BEIR基準測試中,ModernALBERT(參數量5千萬至1.2億)在緊湊型模型中實現了最先進的性能,甚至超越規模更大的全參數化基準模型。我們還提出專家合併技術,可在推理時將MoL壓縮為單一適配器並保持精度,實現高效部署。實驗結果表明,條件化權重空間調製能有效恢復遞迴Transformer在激進參數共享下損失的表達力。
English
Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.