使用学习的各向异性缩放的任务向量进行知识组合
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling
July 3, 2024
作者: Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad
cs.AI
摘要
预训练模型产生强大的通用表示,可以通过微调进行调整。相对于预训练模型的学习权重差异被称为任务向量,描述了微调的方向和步幅。任务向量的重要性在于,可以通过对其进行简单的算术运算来结合来自不同领域的多样表示。本文基于任务向量的这些特性,并旨在回答:(1) 任务向量的组成部分,特别是参数块,是否表现出类似特征,以及 (2) 如何利用这些块来增强知识组合和转移。为此,我们引入了aTLAS算法,该算法线性组合具有不同学习系数的参数块,从而在任务向量级别实现各向异性缩放。我们展示了这种线性组合明确利用了预训练模型的低固有维度,只有少量系数是可学习参数。此外,参数块的组合利用了已学习的表示,从而减少了对大量数据的依赖。我们在任务算术、少样本识别和测试时间适应等任务中展示了我们方法的有效性,这些任务可以是有监督或无监督目标。特别地,我们展示了:(1) 学习的各向异性缩放使任务向量更具分解性,减少了组合中的干扰;(2) 任务向量组合在稀缺或无标记数据时表现出色,并且不太容易受到领域转移的影响,从而提高了泛化能力;(3) 在训练之前混合不同任务向量中最具信息量的参数块可以减少内存占用,并提高知识转移的灵活性。此外,我们展示了aTLAS作为PEFT方法的潜力,特别是在数据较少时,并证明了其可扩展性。
English
Pre-trained models produce strong generic representations that can be adapted
via fine-tuning. The learned weight difference relative to the pre-trained
model, known as a task vector, characterises the direction and stride of
fine-tuning. The significance of task vectors is such that simple arithmetic
operations on them can be used to combine diverse representations from
different domains. This paper builds on these properties of task vectors and
aims to answer (1) whether components of task vectors, particularly parameter
blocks, exhibit similar characteristics, and (2) how such blocks can be used to
enhance knowledge composition and transfer. To this end, we introduce aTLAS, an
algorithm that linearly combines parameter blocks with different learned
coefficients, resulting in anisotropic scaling at the task vector level. We
show that such linear combinations explicitly exploit the low intrinsic
dimensionality of pre-trained models, with only a few coefficients being the
learnable parameters. Furthermore, composition of parameter blocks leverages
the already learned representations, thereby reducing the dependency on large
amounts of data. We demonstrate the effectiveness of our method in task
arithmetic, few-shot recognition and test-time adaptation, with supervised or
unsupervised objectives. In particular, we show that (1) learned anisotropic
scaling allows task vectors to be more disentangled, causing less interference
in composition; (2) task vector composition excels with scarce or no labeled
data and is less prone to domain shift, thus leading to better
generalisability; (3) mixing the most informative parameter blocks across
different task vectors prior to training can reduce the memory footprint and
improve the flexibility of knowledge transfer. Moreover, we show the potential
of aTLAS as a PEFT method, particularly with less data, and demonstrate that
its scalibility.