改進遞迴式Transformer：混合LoRA方法

摘要

遞迴式Transformer中的參數共享雖能縮減模型規模，卻會導致層間表達力坍縮。我們提出LoRA混合機制（MoL），這是一種輕量級條件計算架構，透過在共享前饋網路（FFN）內插入低秩適應（LoRA）專家模組。與先前添加固定或外部適配器的方法不同，MoL無需解綁骨幹參數即可實現共享FFN的權重空間調製。我們預訓練了現代化遞迴架構ModernALBERT，整合了旋轉位置編碼、GeGLU激活函數、FlashAttention注意力機制以及基於蒸餾的初始化策略。在GLUE、SQuAD-v2和BEIR基準測試中，ModernALBERT（參數量5千萬至1.2億）在緊湊型模型中實現了最先進的性能，甚至超越規模更大的全參數化基準模型。我們還提出專家合併技術，可在推理時將MoL壓縮為單一適配器並保持精度，實現高效部署。實驗結果表明，條件化權重空間調製能有效恢復遞迴Transformer在激進參數共享下損失的表達力。

English

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

改進遞迴式Transformer：混合LoRA方法

Improving Recursive Transformers with Mixture of LoRAs

摘要

Support