PUMA: LLaMA-7B의 안전한 추론을 5분 내에 수행

초록

ChatGPT를 대표로 하여, 수많은 기업들이 대형 트랜스포머 모델을 기반으로 한 서비스를 제공하기 시작했습니다. 그러나 이러한 서비스를 사용하면 필연적으로 사용자의 프롬프트가 모델 제공자에게 유출됩니다. 기존 연구에서는 보안 다자간 계산(MPC)을 사용하여 트랜스포머 모델의 안전한 추론을 연구했으며, 이 과정에서 모델 파라미터와 클라이언트의 프롬프트가 비밀로 유지되었습니다. 그럼에도 불구하고, 이러한 프레임워크는 모델 성능, 효율성, 배포 측면에서 여전히 한계가 있었습니다. 이러한 한계를 해결하기 위해, 우리는 빠르고 안전한 트랜스포머 모델 추론을 가능하게 하는 PUMA 프레임워크를 제안합니다. 우리의 프레임워크는 GeLU 및 Softmax와 같은 고비용 함수에 대한 고품질 근사치를 설계하여, 모델 성능을 유지하면서도 안전한 추론의 비용을 크게 줄입니다. 또한, 트랜스포머 아키텍처를 훼손하지 않으면서도 원하는 기능을 충실히 구현하는 안전한 임베딩 및 LayerNorm 절차를 설계했습니다. PUMA는 최신 MPC 프레임워크인 MPCFORMER(ICLR 2023)보다 약 2배 빠르며, 미세 조정 없이도 일반 텍스트 모델과 유사한 정확도를 달성합니다(이는 기존 연구에서 달성하지 못한 부분입니다). 한 가지 더, PUMA는 LLaMA-7B를 평가하여 1개의 토큰을 생성하는 데 약 5분 정도가 소요됩니다. 우리가 아는 한, 이러한 규모의 파라미터를 가진 모델을 MPC 하에서 평가할 수 있는 것은 이번이 처음입니다. PUMA는 SecretFlow-SPU의 Github 저장소에 오픈소스로 공개되었습니다.

English

With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions, such as GeLU and Softmax, which significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about 2x faster than the state-of-the-art MPC framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). One more thing, PUMA can evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU.

PUMA: LLaMA-7B의 안전한 추론을 5분 내에 수행

PUMA: Secure Inference of LLaMA-7B in Five Minutes

초록

Support