WINA: 대규모 언어 모델 추론 가속화를 위한 가중치 정보 기반 뉴런 활성화

초록

대규모 언어 모델(LLM)의 점증하는 계산 요구로 인해 효율적인 추론 및 활성화 전략이 점점 더 중요해지고 있습니다. 최근 Mixture-of-Experts(MoE)와 같은 접근법은 선택적 활성화를 활용하지만 특수한 훈련이 필요하며, 훈련이 필요 없는 희소 활성화 방법은 플러그 앤 플레이 설계를 통해 더 넓은 적용 범위와 우수한 자원 효율성을 제공합니다. 그러나 기존의 많은 방법들은 활성화를 결정하기 위해 은닉 상태의 크기에만 의존하여 높은 근사 오차와 최적이 아닌 추론 정확도를 초래합니다. 이러한 한계를 해결하기 위해, 우리는 은닉 상태의 크기와 가중치 행렬의 열별 ell_2-노름을 함께 고려하는 새로운, 간단하며 훈련이 필요 없는 희소 활성화 프레임워크인 WINA(Weight Informed Neuron Activation)를 제안합니다. 우리는 이 접근법이 기존 기술보다 더 엄격한 이론적 보장을 통해 최적의 근사 오차 한계를 달성하는 희소화 전략으로 이어진다는 것을 보여줍니다. 실험적으로, WINA는 동일한 희소성 수준에서 다양한 LLM 아키텍처와 데이터셋에 걸쳐 최신 방법(예: TEAL)보다 최대 2.94% 더 높은 평균 성능을 보입니다. 이러한 결과는 WINA를 LLM 추론에서 훈련이 필요 없는 희소 활성화의 새로운 성능 최전선으로 위치시키며, 훈련이 필요 없는 희소 활성화 방법을 발전시키고 효율적인 추론을 위한 견고한 기준을 설정합니다. 소스 코드는 https://github.com/microsoft/wina에서 확인할 수 있습니다.

English

The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise ell_2-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

WINA: 대규모 언어 모델 추론 가속화를 위한 가중치 정보 기반 뉴런 활성화

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

초록

Support