ChatPaper.aiChatPaper

AERO:用于高效私密推断的仅Softmax LLMs

AERO: Softmax-Only LLMs for Efficient Private Inference

October 16, 2024
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI

摘要

专有语言模型的普及引起了用户对敏感数据隐私的担忧,强调了对私密推断(PI)的需求,其中推断直接在加密输入上执行。然而,当前的PI方法面临着过高的通信和延迟开销,主要是由于非线性操作造成的。本文提出了一项全面分析,以了解基于transformer的仅解码器语言模型中非线性的作用。我们引入了AERO,一个四步架构优化框架,通过系统地消除诸如LayerNorm和GELU之类的非线性以及减少FLOPs计数,对现有的LLM架构进行优化,以实现高效的PI。我们首次提出了一个仅包含Softmax的架构,具有显著较少的FLOPs,专为高效PI而设计。此外,我们设计了一种新颖的熵正则化技术,以提高Softmax模型的性能。AERO实现了高达4.23倍的通信和1.94倍的延迟减少。我们通过将AERO与最先进技术进行基准测试来验证其有效性。
English
The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23times communication and 1.94times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

Summary

AI-Generated Summary

PDF42November 16, 2024