PUMA:在五分钟内安全推断LLaMA-7B
PUMA: Secure Inference of LLaMA-7B in Five Minutes
July 24, 2023
作者: Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Cheng
cs.AI
摘要
以ChatGPT为代表,许多公司已开始提供基于大型Transformer模型的服务。然而,使用这种服务不可避免地会泄漏用户的提示给模型提供者。先前的研究已经研究了使用安全多方计算(MPC)来实现Transformer模型的安全推断,其中模型参数和客户端的提示被保密。尽管如此,这些框架在模型性能、效率和部署方面仍然存在局限性。为了解决这些限制,我们提出了PUMA框架,以实现快速且安全的Transformer模型推断。我们的框架设计了昂贵函数的高质量近似,如GeLU和Softmax,大大降低了安全推断的成本,同时保持了模型性能。此外,我们设计了安全的嵌入和LayerNorm过程,忠实地实现所需功能,而不破坏Transformer架构。PUMA比最先进的MPC框架MPCFORMER(ICLR 2023)快大约2倍,并且具有与未经微调的明文模型相似的准确性(先前的工作未能实现)。
另外,PUMA可以在大约5分钟内评估LLaMA-7B以生成1个标记。据我们所知,这是第一次能够在MPC下评估具有这种参数大小的模型。PUMA已在SecretFlow-SPU的Github存储库中开源。
English
With ChatGPT as a representative, tons of companies have began to provide
services based on large Transformers models. However, using such a service
inevitably leak users' prompts to the model provider. Previous studies have
studied secure inference for Transformer models using secure multiparty
computation (MPC), where model parameters and clients' prompts are kept secret.
Despite this, these frameworks are still limited in terms of model performance,
efficiency, and deployment. To address these limitations, we propose framework
PUMA to enable fast and secure Transformer model inference. Our framework
designs high quality approximations for expensive functions, such as GeLU and
Softmax, which significantly reduce the cost of secure inference while
preserving the model performance. Additionally, we design secure Embedding and
LayerNorm procedures that faithfully implement the desired functionality
without undermining the Transformer architecture. PUMA is about 2x faster than
the state-of-the-art MPC framework MPCFORMER(ICLR 2023) and has similar
accuracy as plaintext models without fine-tuning (which the previous works
failed to achieve).
One more thing, PUMA can evaluate LLaMA-7B in around 5 minutes to generate 1
token. To our best knowledge, this is the first time that a model with such a
parameter size is able to be evaluated under MPC. PUMA has been open-sourced in
the Github repository of SecretFlow-SPU.