SambaNova SN40L：通過數據流和專家組合來擴展人工智慧記憶壁

摘要

像GPT-4這樣的單體大型語言模型（LLMs）為現代生成式人工智慧應用奠定了基礎。然而，在規模上訓練、提供服務和維護單體LLMs仍然成本高昂且具有挑戰性。現代人工智慧加速器計算與記憶體比例的不成比例增加造成了記憶體壁，迫使採用新方法來部署人工智慧。專家組合（CoE）是一種降低訓練和提供服務成本和複雜性的替代模塊化方法。然而，這種方法在使用傳統硬體時存在兩個主要挑戰：（1）沒有融合操作，較小的模型具有較低的操作強度，這使得實現高利用率更具挑戰性；以及（2）托管大量模型可能既成本高昂又在動態切換之間速度緩慢。在本文中，我們描述了如何結合CoE、串流資料流和三層記憶體系統來擴展人工智慧記憶體壁。我們描述了Samba-CoE，這是一個具有150位專家和總參數數量達一兆的CoE系統。我們將Samba-CoE部署在SambaNova SN40L可重構資料流單元（RDU）上，這是一種商用資料流加速器架構，專為企業推理和訓練應用而共同設計。該晶片引入了一個新的三層記憶體系統，包括片上分佈式SRAM、封裝上的HBM和封裝外的DDR DRAM。一個專用的RDU間網路使得在多個插槽上進行擴展和擴展成為可能。我們展示了在八個RDU插槽上運行各種基準測試時，與未融合基準相比，速度提升範圍從2倍到13倍不等。我們展示了對於CoE推理部署，8個插槽的RDU節點可將機器佔地面積減少高達19倍，將模型切換時間加快15倍至31倍，並在DGX H100上實現整體速度提升3.7倍，並在DGX A100上實現6.6倍的速度提升。

English

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.

SambaNova SN40L：通過數據流和專家組合來擴展人工智慧記憶壁

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

摘要

Support