PUMA:在五分鐘內安全推斷 LLaMA-7B
PUMA: Secure Inference of LLaMA-7B in Five Minutes
July 24, 2023
作者: Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Cheng
cs.AI
摘要
以ChatGPT為代表,許多公司已開始提供基於大型Transformer模型的服務。然而,使用此類服務不可避免地會洩漏使用者的提示給模型提供者。先前的研究已探討使用安全多方計算(MPC)來保護Transformer模型的安全推論,其中模型參數和客戶的提示被保密。儘管如此,這些框架在模型性能、效率和部署方面仍然存在限制。為了解決這些限制,我們提出了PUMA框架,以實現快速且安全的Transformer模型推論。我們的框架設計了昂貴功能的高質量近似,例如GeLU和Softmax,顯著降低了安全推論的成本,同時保持了模型性能。此外,我們設計了安全的嵌入和LayerNorm程序,忠實地實現所需功能,而不損害Transformer架構。PUMA比最先進的MPC框架MPCFORMER(ICLR 2023)快大約2倍,並且具有與未進行微調的明文模型相似的準確性(先前的工作未能實現)。
另外,PUMA可以在約5分鐘內評估LLaMA-7B以生成1個標記。據我們所知,這是首次能夠在MPC下評估具有此參數大小的模型。PUMA已在SecretFlow-SPU的Github存儲庫中開源。
English
With ChatGPT as a representative, tons of companies have began to provide
services based on large Transformers models. However, using such a service
inevitably leak users' prompts to the model provider. Previous studies have
studied secure inference for Transformer models using secure multiparty
computation (MPC), where model parameters and clients' prompts are kept secret.
Despite this, these frameworks are still limited in terms of model performance,
efficiency, and deployment. To address these limitations, we propose framework
PUMA to enable fast and secure Transformer model inference. Our framework
designs high quality approximations for expensive functions, such as GeLU and
Softmax, which significantly reduce the cost of secure inference while
preserving the model performance. Additionally, we design secure Embedding and
LayerNorm procedures that faithfully implement the desired functionality
without undermining the Transformer architecture. PUMA is about 2x faster than
the state-of-the-art MPC framework MPCFORMER(ICLR 2023) and has similar
accuracy as plaintext models without fine-tuning (which the previous works
failed to achieve).
One more thing, PUMA can evaluate LLaMA-7B in around 5 minutes to generate 1
token. To our best knowledge, this is the first time that a model with such a
parameter size is able to be evaluated under MPC. PUMA has been open-sourced in
the Github repository of SecretFlow-SPU.