加法是能源高效语言模型所需的全部
Addition is All You Need for Energy-efficient Language Models
October 1, 2024
作者: Hongyin Luo, Wei Sun
cs.AI
摘要
大型神经网络在浮点张量乘法上消耗了大部分计算资源。在这项工作中,我们发现一个浮点乘法器可以用一个高精度的整数加法器来近似。我们提出了线性复杂度乘法 L-Mul 算法,它用整数加法操作来近似浮点数乘法。这种新算法比8位浮点乘法消耗的计算资源显著更少,但却实现了更高的精度。与8位浮点乘法相比,该方法实现了更高的精度,但消耗的比特级计算资源明显更少。由于浮点数乘法比整数加法操作需要更高的能量,将 L-Mul 操作应用于张量处理硬件有望通过逐元素浮点张量乘法降低95%的能量成本和80%的点积能量成本。我们计算了 L-Mul 的理论误差期望,并在广泛的文本、视觉和符号任务上评估了该算法,包括自然语言理解、结构推理、数学和常识问题回答。我们的数值分析实验证明了理论误差估计,表明具有4位尾数的 L-Mul 达到了与 float8_e4m3 乘法相当的精度,而具有3位尾数的 L-Mul 胜过了 float8_e5m2。在流行基准测试上的评估结果显示,直接将 L-Mul 应用于注意力机制几乎没有损失。我们进一步展示,在变压器模型中用3位尾数 L-Mul 替换所有浮点乘法在微调和推断中实现了与使用 float8_e4m3 作为累积精度相当的精度。
English
Large neural networks spend most computation on floating point tensor
multiplications. In this work, we find that a floating point multiplier can be
approximated by one integer adder with high precision. We propose the
linear-complexity multiplication L-Mul algorithm that approximates floating
point number multiplication with integer addition operations. The new algorithm
costs significantly less computation resource than 8-bit floating point
multiplication but achieves higher precision. Compared to 8-bit floating point
multiplications, the proposed method achieves higher precision but consumes
significantly less bit-level computation. Since multiplying floating point
numbers requires substantially higher energy compared to integer addition
operations, applying the L-Mul operation in tensor processing hardware can
potentially reduce 95% energy cost by element-wise floating point tensor
multiplications and 80% energy cost of dot products. We calculated the
theoretical error expectation of L-Mul, and evaluated the algorithm on a wide
range of textual, visual, and symbolic tasks, including natural language
understanding, structural reasoning, mathematics, and commonsense question
answering. Our numerical analysis experiments agree with the theoretical error
estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable
precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa
outperforms float8_e5m2. Evaluation results on popular benchmarks show that
directly applying L-Mul to the attention mechanism is almost lossless. We
further show that replacing all floating point multiplications with 3-bit
mantissa L-Mul in a transformer model achieves equivalent precision as using
float8_e4m3 as accumulation precision in both fine-tuning and inference.Summary
AI-Generated Summary