加法是能源高效语言模型所需的全部

摘要

大型神经网络在浮点张量乘法上消耗了大部分计算资源。在这项工作中，我们发现一个浮点乘法器可以用一个高精度的整数加法器来近似。我们提出了线性复杂度乘法 L-Mul 算法，它用整数加法操作来近似浮点数乘法。这种新算法比8位浮点乘法消耗的计算资源显著更少，但却实现了更高的精度。与8位浮点乘法相比，该方法实现了更高的精度，但消耗的比特级计算资源明显更少。由于浮点数乘法比整数加法操作需要更高的能量，将 L-Mul 操作应用于张量处理硬件有望通过逐元素浮点张量乘法降低95%的能量成本和80%的点积能量成本。我们计算了 L-Mul 的理论误差期望，并在广泛的文本、视觉和符号任务上评估了该算法，包括自然语言理解、结构推理、数学和常识问题回答。我们的数值分析实验证明了理论误差估计，表明具有4位尾数的 L-Mul 达到了与 float8_e4m3 乘法相当的精度，而具有3位尾数的 L-Mul 胜过了 float8_e5m2。在流行基准测试上的评估结果显示，直接将 L-Mul 应用于注意力机制几乎没有损失。我们进一步展示，在变压器模型中用3位尾数 L-Mul 替换所有浮点乘法在微调和推断中实现了与使用 float8_e4m3 作为累积精度相当的精度。

English

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8_e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8_e4m3 as accumulation precision in both fine-tuning and inference.

加法是能源高效语言模型所需的全部

Addition is All You Need for Energy-efficient Language Models

摘要

Support