食腐鬣狗:將Transformer精煉為長卷積模型
Scavenging Hyena: Distilling Transformers into Long Convolution Models
January 31, 2024
作者: Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang
cs.AI
摘要
大型語言模型(LLMs)的快速演進,以GPT-4等架構為代表,已經重新塑造了自然語言處理的格局。本文介紹了一種開創性方法,以應對與LLM預訓練相關的效率問題,提議使用知識蒸餾進行跨架構轉移。借鑒高效的Hyena機制的見解,我們的方法通過將轉換器模型中的注意力頭替換為Hyena,提供了一種成本效益高的替代方案,同時應對了處理長篇上下文信息的挑戰,這是二次注意機制固有的。與傳統的壓縮專注方法不同,我們的技術不僅提升了推理速度,還在準確性和效率方面超越了預訓練。在不斷演進的LLMs時代,我們的工作有助於追求可持續的人工智能解決方案,取得了計算能力與環境影響之間的平衡。
English
The rapid evolution of Large Language Models (LLMs), epitomized by
architectures like GPT-4, has reshaped the landscape of natural language
processing. This paper introduces a pioneering approach to address the
efficiency concerns associated with LLM pre-training, proposing the use of
knowledge distillation for cross-architecture transfer. Leveraging insights
from the efficient Hyena mechanism, our method replaces attention heads in
transformer models by Hyena, offering a cost-effective alternative to
traditional pre-training while confronting the challenge of processing long
contextual information, inherent in quadratic attention mechanisms. Unlike
conventional compression-focused methods, our technique not only enhances
inference speed but also surpasses pre-training in terms of both accuracy and
efficiency. In the era of evolving LLMs, our work contributes to the pursuit of
sustainable AI solutions, striking a balance between computational power and
environmental impact.