AERO: 効率的なプライベート推論のためのSoftmax-Only LLMs

要旨

プロプライエタリ言語モデルの普及により、ユーザーの機密データに関するプライバシー懸念が高まり、暗号化された入力に直接推論を行うプライベート推論（PI）の必要性が強調されています。しかしながら、現在のPI手法は、非線形演算に起因する通信とレイテンシのオーバーヘッドが著しく高いです。本論文では、transformerベースのデコーダーのみの言語モデルにおける非線形性の役割を理解するための包括的な分析を提示します。我々は、非線形性（例：LayerNormやGELU）を系統的に除去し、FLOPs数を削減することで、効率的なPI向けに既存のLLMアーキテクチャを洗練させる4段階のアーキテクチャ最適化フレームワークであるAEROを紹介します。初めて、効率的なPI向けにFLOPsが大幅に少ないSoftmaxのみのアーキテクチャを提案します。さらに、Softmaxのみのモデルの性能を向上させるための新しいエントロピー正則化技術を考案します。AEROは、最大4.23倍の通信および1.94倍のレイテンシ削減を達成します。我々は、AEROの効果を最先端技術と比較することで検証します。

English

The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23times communication and 1.94times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

AERO: 効率的なプライベート推論のためのSoftmax-Only LLMs

AERO: Softmax-Only LLMs for Efficient Private Inference

要旨

Support