MixCE:透過混合正向和反向交叉熵訓練自回歸語言模型
MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
May 26, 2023
作者: Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg
cs.AI
摘要
自回歸語言模型是通過最小化模型分佈Q相對於數據分佈P的交叉熵來進行訓練,即最小化前向交叉熵,這等同於最大概似估計(MLE)。我們觀察到以這種方式訓練的模型可能會「過度泛化」,即生成非人類風格的文本。此外,我們認為反向交叉熵,即P相對於Q的交叉熵,更能反映人類如何評估模型生成的文本。因此,我們提出了使用MixCE進行學習,這是一個將前向和反向交叉熵混合的目標。我們在已知P的合成數據設置和真實數據上評估了使用此目標訓練的模型,並展示結果模型生成的文本更好,無需複雜的解碼策略。我們的代碼和模型可在以下鏈接公開獲取:https://github.com/bloomberg/mixce-acl2023
English
Autoregressive language models are trained by minimizing the cross-entropy of
the model distribution Q relative to the data distribution P -- that is,
minimizing the forward cross-entropy, which is equivalent to maximum likelihood
estimation (MLE). We have observed that models trained in this way may
"over-generalize", in the sense that they produce non-human-like text.
Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P
relative to Q, is a better reflection of how a human would evaluate text
generated by a model. Hence, we propose learning with MixCE, an objective that
mixes the forward and reverse cross-entropies. We evaluate models trained with
this objective on synthetic data settings (where P is known) and real data, and
show that the resulting models yield better generated text without complex
decoding strategies. Our code and models are publicly available at
https://github.com/bloomberg/mixce-acl2023