LLaMA-NAS: 大規模言語モデルのための効率的なニューラルアーキテクチャサーチ

要旨

現代の大規模言語モデル（LLM）は、自然言語処理、複雑な推論、感情分析などのタスクを解決する能力が非常に高く、その結果、広く採用されています。しかし、これらの能力は非常に高いメモリと計算コストを伴い、ほとんどのハードウェアプラットフォームでのLLMの使用を妨げています。この問題を緩和するため、我々はLLaMA2-7Bを基にしたワンショットNASを用いて、パレート最適なネットワークアーキテクチャを見つける効果的な方法を提案します。具体的には、LLaMA2-7Bを一度だけファインチューニングし、その後、遺伝的アルゴリズムに基づく探索を適用して、より小さく計算コストの低いネットワークアーキテクチャを見つけます。特定の標準ベンチマークタスクにおいて、事前学習済みのLLaMA2-7Bネットワークが不必要に大きく複雑であることを示します。さらに、特定のタスクにおいて、モデルサイズを1.5倍削減し、スループットを1.3倍向上させ、精度の低下をほとんどなくすことを実証します。我々の方法は、より小さく高性能なネットワークアーキテクチャを見つけるだけでなく、特定のプルーニングやスパース化技術よりも効果的かつ効率的にこれを達成します。最後に、量子化が我々の方法と補完的であり、見つけたネットワークのサイズと複雑さを量子化を用いてさらに削減できることを示します。我々の研究は、より安価で入手しやすいハードウェアプラットフォームで使用できるLLMを自動的に作成する方法を提供すると考えています。

English

The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

LLaMA-NAS: 大規模言語モデルのための効率的なニューラルアーキテクチャサーチ

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

要旨

Support