SHERL:为资源有限的迁移学习综合高准确性和高效内存
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
July 10, 2024
作者: Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, Long Chen
cs.AI
摘要
参数高效迁移学习(PETL)已成为一个蓬勃发展的研究领域,用于将大型预训练模型调整到下游任务,大大减少可训练参数,同时解决微调过程中的内存挑战。为了解决这一问题,内存高效系列(METL)避免通过大型主干网络反向传播梯度。然而,它们通过仅依赖冻结的中间输出并限制对预训练模型先前知识的详尽探索来进行妥协。此外,跨层特征之间的依赖性和冗余性经常被忽视,从而淹没更具有区分性的表示,并导致与传统PETL方法相比的固有性能差距。因此,我们提出了一种名为SHERL的创新METL策略,用于资源有限的场景,将整个适应过程分解为两个连续且互补的过程。在早期路径中,通过反冗余操作合并中间输出,增强它们对后续交互的兼容性;因此在晚期路径中,利用最少的晚期预训练层可以减轻内存开销的高峰需求,并将这些相当灵活的特征调整为更适应和强大的表示,以适应新领域。对视觉与语言以及仅语言任务进行的广泛消融实验表明,SHERL结合了参数和内存高效技术的优势,在微调过程中跨多种架构表现出与更低内存相媲美或更好的性能。我们的代码可在以下网址公开获取:https://github.com/Paranioar/SHERL。
English
Parameter-efficient transfer learning (PETL) has emerged as a flourishing
research field for adapting large pre-trained models to downstream tasks,
greatly reducing trainable parameters while grappling with memory challenges
during fine-tuning. To address it, memory-efficient series (METL) avoid
backpropagating gradients through the large backbone. However, they compromise
by exclusively relying on frozen intermediate outputs and limiting the
exhaustive exploration of prior knowledge from pre-trained models. Moreover,
the dependency and redundancy between cross-layer features are frequently
overlooked, thereby submerging more discriminative representations and causing
an inherent performance gap (vs. conventional PETL methods). Hence, we propose
an innovative METL strategy called SHERL for resource-limited scenarios to
decouple the entire adaptation into two successive and complementary processes.
In the early route, intermediate outputs are consolidated via an
anti-redundancy operation, enhancing their compatibility for subsequent
interactions; thereby in the late route, utilizing minimal late pre-trained
layers could alleviate the peak demand on memory overhead and regulate these
fairly flexible features into more adaptive and powerful representations for
new domains. Extensive ablations on vision-and-language and language-only tasks
show that SHERL combines the strengths of both parameter and memory-efficient
techniques, performing on-par or better across diverse architectures with lower
memory during fine-tuning. Our code is publicly available at:
https://github.com/Paranioar/SHERL.Summary
AI-Generated Summary