ChatPaper.aiChatPaper

提升递归Transformer性能:混合LoRA方法研究

Improving Recursive Transformers with Mixture of LoRAs

December 14, 2025
作者: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian
cs.AI

摘要

递归Transformer中的参数共享虽能缩减模型规模,却会导致层间表达能力退化。我们提出LoRA混合专家模型(MoL),这是一种轻量级条件计算机制,通过在共享前馈网络(FFN)内部插入低秩自适应(LoRA)专家模块来实现参数空间的条件调控。与以往添加固定或外部适配器的方法不同,MoL能在保持主干参数绑定的前提下,实现基于令牌条件的共享FFN权重空间调制。我们预训练了现代化递归架构ModernALBERT,融合了旋转位置编码、GeGLU激活函数、FlashAttention注意力机制以及基于蒸馏的初始化策略。在GLUE、SQuAD-v2和BEIR基准测试中,ModernALBERT(50M-120M参数)在紧凑模型中达到最优性能,并超越完全参数化的大型基线模型。我们还提出专家融合方法,在推理时将MoL压缩为单一适配器且保持精度,实现高效部署。实验结果表明,条件权重空间调制能有效恢复递归Transformer在激进参数共享下损失的表达能力。
English
Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.
PDF01December 20, 2025