Q-稀疏:所有大型語言模型都可以完全稀疏激活
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
July 15, 2024
作者: Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei
cs.AI
摘要
我們介紹了 Q-Sparse,這是一種簡單而有效的方法,用於訓練稀疏激活的大型語言模型(LLMs)。Q-Sparse實現了LLMs中激活的完全稀疏,這可以在推論中帶來顯著的效率提升。這是通過對激活應用top-K稀疏化和使用直通估計器進行訓練來實現的。這項工作的關鍵結果有:(1)Q-Sparse在推論時可以達到與基準LLMs相媲美的結果,同時更加高效;(2)我們提出了一個適用於稀疏激活LLMs的推論最優擴展定律;(3)Q-Sparse在不同設置下都很有效,包括從頭開始訓練、繼續訓練現成的LLMs和微調;(4)Q-Sparse適用於完整精度和1位元LLMs(例如BitNet b1.58)。特別是,BitNet b1.58和Q-Sparse的協同作用(可以配備MoE)為未來LLMs的效率革新提供了基石和清晰的道路,包括成本和能源消耗。
English
We introduce, Q-Sparse, a simple yet effective approach to training
sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity
of activations in LLMs which can bring significant efficiency gains in
inference. This is achieved by applying top-K sparsification to the activations
and the straight-through-estimator to the training. The key results from this
work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs
while being much more efficient at inference time; (2) We present an
inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is
effective in different settings, including training-from-scratch,
continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for
both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the
synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the
cornerstone and a clear path to revolutionize the efficiency, including cost
and energy consumption, of future LLMs.Summary
AI-Generated Summary