ChatPaper.aiChatPaper

数学建模和概率优化在生成式人工智能中的本质

The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

October 24, 2024
作者: Fulu Li
cs.AI

摘要

本文对Transformer模型[33]中的一些关键组件的数学问题形式化和概率优化探索进行了深入分析,这些组件属于生成式人工智能领域。我们从算法和概率优化的角度探讨并讨论了一些潜在的进一步增强当前生成式人工智能模型关键基础技术的方法。具体而言,我们提出了一种基于类似于[9]中字节对编码(BPE)算法的初始设置的子词编码(SWE)的最优解,其目标类似于[28, 31]中WordPiece方法的目标,即最大化训练数据的似然性。我们还提出了交叉熵优化方法,用于优化word2vec模型[17]的超参数。此外,我们提出了一种将旋转位置编码(RoPE)[32]和带线性偏差的注意力(ALiBi)[23]以及谐波级数进行因式组合的方法。我们还提出了一种概率FlashAttention [6, 7](PrFlashAttention)方法,通过在矩阵上设置一个概率分布来决定哪个块可能参与给定轮次的注意力计算,同时通过重新塑造张量来保持自回归语言模型的张量的下三角形状。最后,我们提出了基于[16]中提出的框架的多查询注意力(MQA)的关键-值(KV)缓存的阶梯自适应量化(SAQ),以在实现合理的模型质量和成本节约的同时实现逐渐的量化退化。
English
In this paper, we give an in-depth analysis on the mathematical problem formulations and the probabilistic optimization explorations for some of the key components in Transformer model [33] in the field of generative AI. We explore and discuss some potential further enhancement for current state of the art methods for some key underlying technologies of generative AI models from algorithmic and probabilistic optimization perspective. In particular, we present an optimal solution for sub-word encoding (SWE) based on similar initial settings as that of byte-pair encoding (BPE) algorithm in [9] with similar objectives as that of WordPiece approach in [28, 31] to maximize the likelihood of the training data. We also present cross entropy optimization method to optimize hyperparameters for word2vec model [17]. In addition, we propose a factored combination of rotary positional encoding (RoPE) [32] and attention with linear biases (ALiBi) [23] with a harmonic series. We also present a probabilistic FlashAttention [6, 7] (PrFlashAttention) method with a probability distribution over block distances in the matrix to decide which block is likely to participate in a given round of attention computation while maintaining the lower triangle shape of the tensor for autoregressive language models by re-shaping the tensors. Finally, we present staircase adaptive quantization (SAQ) of key-value (KV) cache for multi-query attention (MQA) based on the framework presented in [16] to have gradual quantization degradation while achieving reasonable model quality and cost savings.

Summary

AI-Generated Summary

PDF72November 16, 2024