食腐鬣狗：將Transformer精煉為長卷積模型

摘要

大型語言模型（LLMs）的快速演進，以GPT-4等架構為代表，已經重新塑造了自然語言處理的格局。本文介紹了一種開創性方法，以應對與LLM預訓練相關的效率問題，提議使用知識蒸餾進行跨架構轉移。借鑒高效的Hyena機制的見解，我們的方法通過將轉換器模型中的注意力頭替換為Hyena，提供了一種成本效益高的替代方案，同時應對了處理長篇上下文信息的挑戰，這是二次注意機制固有的。與傳統的壓縮專注方法不同，我們的技術不僅提升了推理速度，還在準確性和效率方面超越了預訓練。在不斷演進的LLMs時代，我們的工作有助於追求可持續的人工智能解決方案，取得了計算能力與環境影響之間的平衡。

English

The rapid evolution of Large Language Models (LLMs), epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns associated with LLM pre-training, proposing the use of knowledge distillation for cross-architecture transfer. Leveraging insights from the efficient Hyena mechanism, our method replaces attention heads in transformer models by Hyena, offering a cost-effective alternative to traditional pre-training while confronting the challenge of processing long contextual information, inherent in quadratic attention mechanisms. Unlike conventional compression-focused methods, our technique not only enhances inference speed but also surpasses pre-training in terms of both accuracy and efficiency. In the era of evolving LLMs, our work contributes to the pursuit of sustainable AI solutions, striking a balance between computational power and environmental impact.

食腐鬣狗：將Transformer精煉為長卷積模型

Scavenging Hyena: Distilling Transformers into Long Convolution Models

摘要

Support