PUMA: Inferenza Sicura di LLaMA-7B in Cinque Minuti

Abstract

Con ChatGPT come rappresentante, numerose aziende hanno iniziato a offrire servizi basati su modelli Transformer di grandi dimensioni. Tuttavia, l'utilizzo di tali servizi comporta inevitabilmente la divulgazione dei prompt degli utenti al fornitore del modello. Studi precedenti hanno esaminato l'inferenza sicura per modelli Transformer utilizzando il calcolo multipartitico sicuro (MPC), in cui i parametri del modello e i prompt dei clienti rimangono segreti. Nonostante ciò, questi framework sono ancora limitati in termini di prestazioni del modello, efficienza e implementazione. Per affrontare queste limitazioni, proponiamo il framework PUMA per consentire un'inferenza rapida e sicura dei modelli Transformer. Il nostro framework progetta approssimazioni di alta qualità per funzioni costose, come GeLU e Softmax, che riducono significativamente il costo dell'inferenza sicura preservando le prestazioni del modello. Inoltre, progettiamo procedure sicure per Embedding e LayerNorm che implementano fedelmente la funzionalità desiderata senza compromettere l'architettura Transformer. PUMA è circa 2 volte più veloce del framework MPC all'avanguardia MPCFORMER (ICLR 2023) e ha un'accuratezza simile ai modelli in chiaro senza necessità di fine-tuning (obiettivo che i lavori precedenti non sono riusciti a raggiungere). Un ulteriore vantaggio è che PUMA può valutare LLaMA-7B in circa 5 minuti per generare 1 token. Per quanto ne sappiamo, questa è la prima volta che un modello con una tale dimensione di parametri può essere valutato sotto MPC. PUMA è stato reso open-source nel repository Github di SecretFlow-SPU.

English

With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions, such as GeLU and Softmax, which significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about 2x faster than the state-of-the-art MPC framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). One more thing, PUMA can evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU.

PUMA: Inferenza Sicura di LLaMA-7B in Cinque Minuti

PUMA: Secure Inference of LLaMA-7B in Five Minutes

Abstract

Support