LLaMA-NAS: 대규모 언어 모델을 위한 효율적인 신경망 구조 탐색

초록

현대의 대규모 언어 모델(LLM)은 자연어 처리, 복잡한 추론, 감정 분석 등의 과제를 해결하는 데 있어 탁월한 능력을 보여주며, 이로 인해 광범위하게 채택되고 있습니다. 그러나 이러한 능력은 매우 높은 메모리와 계산 비용을 수반하기 때문에 대부분의 하드웨어 플랫폼에서 LLM을 사용하는 데 제약이 있습니다. 이를 완화하기 위해, 우리는 LLaMA2-7B를 기반으로 한 원샷 NAS(Neural Architecture Search)를 통해 파레토 최적의 네트워크 아키텍처를 찾는 효과적인 방법을 제안합니다. 구체적으로, LLaMA2-7B를 한 번만 미세 조정한 후 유전 알고리즘 기반 탐색을 적용하여 더 작고 계산 복잡도가 낮은 네트워크 아키텍처를 찾습니다. 우리는 특정 표준 벤치마크 과제에 대해 사전 훈련된 LLaMA2-7B 네트워크가 불필요하게 크고 복잡하다는 것을 보여줍니다. 더 구체적으로, 특정 과제에서 정확도 저하를 거의 없이 모델 크기를 1.5배 줄이고 처리 속도를 1.3배 향상시킬 수 있음을 입증합니다. 더 작고 성능이 높은 네트워크 아키텍처를 찾는 것 외에도, 우리의 방법은 특정 가지치기(pruning) 또는 희소화(sparsification) 기법보다 더 효과적이고 효율적으로 이를 달성합니다. 마지막으로, 양자화(quantization)가 우리의 방법과 상호 보완적이며, 우리가 찾은 네트워크의 크기와 복잡도를 양자화를 통해 더욱 줄일 수 있음을 보여줍니다. 우리는 이 연구가 더 저렴하고 쉽게 구할 수 있는 하드웨어 플랫폼에서 사용할 수 있는 LLM을 자동으로 생성하는 방법을 제공한다고 믿습니다.

English

The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

LLaMA-NAS: 대규모 언어 모델을 위한 효율적인 신경망 구조 탐색

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

초록

Support