LLaMA-NAS：大型语言模型的高效神经架构搜索

摘要

现代大型语言模型（LLMs）在解决自然语言处理、复杂推理、情感分析和其他任务方面的能力非凡，这促使它们被广泛采用。不幸的是，这些能力伴随着非常高的内存和计算成本，这使得在大多数硬件平台上无法使用LLMs。为了缓解这一问题，我们提出了一种有效的方法，基于LLaMA2-7B使用一次性NAS来找到帕累托最优网络架构。具体来说，我们仅对LLaMA2-7B进行一次微调，然后应用基于遗传算法的搜索来找到更小、计算复杂性更低的网络架构。我们展示了对于某些标准基准任务，预训练的LLaMA2-7B网络是不必要地庞大和复杂。更具体地，我们展示了在某些任务中模型尺寸减少了1.5倍，吞吐量加快了1.3倍，而准确率几乎没有下降。除了找到更小、性能更高的网络架构外，我们的方法比某些剪枝或稀疏技术更有效、更高效地实现了这一目标。最后，我们展示了量化如何与我们的方法相辅相成，以及我们找到的网络的尺寸和复杂性可以通过量化进一步减小。我们相信我们的工作提供了一种自动创建LLMs的方式，这些模型可以在更便宜、更易获得的硬件平台上使用。

English

The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

LLaMA-NAS：大型语言模型的高效神经架构搜索

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

摘要

Support