Transformer层作为画家
Transformer Layers as Painters
July 12, 2024
作者: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones
cs.AI
摘要
尽管大型语言模型普遍采用transformer,但其内部运作机制并不为人熟知。我们旨在更好地理解在预训练transformer的各层中移除或重新组织信息的影响。这种理解既可以更好地利用现有模型,也可以进行架构改进以生成新的变体。我们提出了一系列关于冻结模型的实证研究,表明预训练transformer的较低和最终层与中间层不同,但中间层具有令人惊讶的一致性。我们进一步展示,某些问题类别对跳过层、以不同于训练方式的顺序运行层或并行运行层具有鲁棒性。我们的观察表明,即使是冻结的预训练模型也可以通过跳过层或并行运行层来优雅地在准确性和延迟之间进行权衡。
English
Despite their nearly universal adoption for large language models, the
internal workings of transformers are not well understood. We aim to better
understand the impact of removing or reorganizing information throughout the
layers of a pretrained transformer. Such an understanding could both yield
better usage of existing models as well as to make architectural improvements
to produce new variants. We present a series of empirical studies on frozen
models that show that the lower and final layers of pretrained transformers
differ from middle layers, but that middle layers have a surprising amount of
uniformity. We further show that some classes of problems have robustness to
skipping layers, running the layers in an order different from how they were
trained, or running the layers in parallel. Our observations suggest that even
frozen pretrained models may gracefully trade accuracy for latency by skipping
layers or running layers in parallel.Summary
AI-Generated Summary