SambaNova SN40L:通过数据流和专家组合扩展人工智能内存墙
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
May 13, 2024
作者: Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Edison Chen, Kaizhao Liang, Swayambhoo Jain, Urmish Thakker, Dawei Huang, Sumti Jairath, Kevin J. Brown, Kunle Olukotun
cs.AI
摘要
像GPT-4这样的单体大型语言模型(LLMs)为现代生成式人工智能应用铺平了道路。然而,在规模上训练、提供服务和维护单体LLMs仍然成本高昂且具有挑战性。现代人工智能加速器计算与内存比例的不成比例增长造成了内存墙,迫使采用新方法部署人工智能。专家组合(CoE)是一种降低训练和提供服务成本和复杂性的替代模块化方法。然而,这种方法在使用传统硬件时存在两个关键挑战:(1) 在没有融合操作的情况下,较小的模型具有较低的操作强度,这使得更难实现高利用率;(2) 托管大量模型可能要么成本高昂,要么在动态切换之间速度慢。
在本文中,我们描述了如何结合CoE、流式数据流和三层内存系统来扩展人工智能内存墙。我们描述了Samba-CoE,这是一个具有150个专家和一万亿总参数的CoE系统。我们将Samba-CoE部署在SambaNova SN40L可重构数据流单元(RDU)上,这是一种商用数据流加速器架构,专为企业推理和训练应用而共同设计。该芯片引入了一个新的三层内存系统,包括片上分布式SRAM、封装HBM和片外DDR DRAM。专用的RDU网络使其能够在多个插槽上进行扩展。我们展示了在八个RDU插槽上运行各种基准测试时,与未融合的基准相比,速度提升从2倍到13倍不等。我们表明,对于CoE推理部署,8个插槽的RDU节点可以将机器占地面积减少高达19倍,将模型切换时间加快15倍至31倍,并在DGX H100上实现整体速度提升3.7倍,在DGX A100上实现整体速度提升6.6倍。
English
Monolithic large language models (LLMs) like GPT-4 have paved the way for
modern generative AI applications. Training, serving, and maintaining
monolithic LLMs at scale, however, remains prohibitively expensive and
challenging. The disproportionate increase in compute-to-memory ratio of modern
AI accelerators have created a memory wall, necessitating new methods to deploy
AI. Composition of Experts (CoE) is an alternative modular approach that lowers
the cost and complexity of training and serving. However, this approach
presents two key challenges when using conventional hardware: (1) without fused
operations, smaller models have lower operational intensity, which makes high
utilization more challenging to achieve; and (2) hosting a large number of
models can be either prohibitively expensive or slow when dynamically switching
between them.
In this paper, we describe how combining CoE, streaming dataflow, and a
three-tier memory system scales the AI memory wall. We describe Samba-CoE, a
CoE system with 150 experts and a trillion total parameters. We deploy
Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a
commercial dataflow accelerator architecture that has been co-designed for
enterprise inference and training applications. The chip introduces a new
three-tier memory system with on-chip distributed SRAM, on-package HBM, and
off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out
over multiple sockets. We demonstrate speedups ranging from 2x to 13x on
various benchmarks running on eight RDU sockets compared with an unfused
baseline. We show that for CoE inference deployments, the 8-socket RDU Node
reduces machine footprint by up to 19x, speeds up model switching time by 15x
to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a
DGX A100.Summary
AI-Generated Summary