数学モデリングと確率最適化の性質生成AIにおけるエンジニアリング

要旨

本論文では、生成AIの分野におけるTransformerモデル[33]のいくつかの主要なコンポーネントに対する数学的問題の定式化と確率的最適化の探求について、詳細な分析を行います。我々は、アルゴリズムと確率的最適化の観点から、生成AIモデルのいくつかの主要な基盤技術に対する現行の最先端手法のさらなる強化を探求し、議論します。特に、訓練データの尤度を最大化するために、バイトペアエンコーディング（BPE）アルゴリズム[9]の初期設定と同様の初期設定に基づいたサブワードエンコーディング（SWE）の最適解を提案します。また、word2vecモデル[17]のハイパーパラメータを最適化するための交差エントロピー最適化手法を提案します。さらに、rotary positional encoding（RoPE）[32]とattention with linear biases（ALiBi）[23]を調和級数で因数分解した組み合わせを提案します。また、自己回帰言語モデルのテンソルの下三角形の形状を維持しながら、確率的FlashAttention [6, 7]（PrFlashAttention）メソッドを提案し、テンソルを再形成することで、与えられたラウンドの注意計算に参加する可能性のあるブロックを決定するための行列上のブロック間の確率分布を使用します。最後に、[16]で提示されたフレームワークに基づく、適切なモデル品質とコスト削減を達成しながら、漸進的な量子化劣化を持つマルチクエリアテンション（MQA）用のキー値（KV）キャッシュの階段状適応量子化（SAQ）を提案します。

English

In this paper, we give an in-depth analysis on the mathematical problem formulations and the probabilistic optimization explorations for some of the key components in Transformer model [33] in the field of generative AI. We explore and discuss some potential further enhancement for current state of the art methods for some key underlying technologies of generative AI models from algorithmic and probabilistic optimization perspective. In particular, we present an optimal solution for sub-word encoding (SWE) based on similar initial settings as that of byte-pair encoding (BPE) algorithm in [9] with similar objectives as that of WordPiece approach in [28, 31] to maximize the likelihood of the training data. We also present cross entropy optimization method to optimize hyperparameters for word2vec model [17]. In addition, we propose a factored combination of rotary positional encoding (RoPE) [32] and attention with linear biases (ALiBi) [23] with a harmonic series. We also present a probabilistic FlashAttention [6, 7] (PrFlashAttention) method with a probability distribution over block distances in the matrix to decide which block is likely to participate in a given round of attention computation while maintaining the lower triangle shape of the tensor for autoregressive language models by re-shaping the tensors. Finally, we present staircase adaptive quantization (SAQ) of key-value (KV) cache for multi-query attention (MQA) based on the framework presented in [16] to have gradual quantization degradation while achieving reasonable model quality and cost savings.

数学モデリングと確率最適化の性質生成AIにおけるエンジニアリング

The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

要旨

Support