MixCE：透過混合正向和反向交叉熵訓練自回歸語言模型

摘要

自回歸語言模型是通過最小化模型分佈Q相對於數據分佈P的交叉熵來進行訓練，即最小化前向交叉熵，這等同於最大概似估計（MLE）。我們觀察到以這種方式訓練的模型可能會「過度泛化」，即生成非人類風格的文本。此外，我們認為反向交叉熵，即P相對於Q的交叉熵，更能反映人類如何評估模型生成的文本。因此，我們提出了使用MixCE進行學習，這是一個將前向和反向交叉熵混合的目標。我們在已知P的合成數據設置和真實數據上評估了使用此目標訓練的模型，並展示結果模型生成的文本更好，無需複雜的解碼策略。我們的代碼和模型可在以下鏈接公開獲取：https://github.com/bloomberg/mixce-acl2023

English

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may "over-generalize", in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023

MixCE：透過混合正向和反向交叉熵訓練自回歸語言模型

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

摘要

Support