MixCE：通过混合正向和反向交叉熵训练自回归语言模型

摘要

自回归语言模型通过最小化模型分布Q相对于数据分布P的交叉熵来进行训练，即最小化前向交叉熵，这等价于最大似然估计（MLE）。我们观察到以这种方式训练的模型可能会出现“过度泛化”的情况，即它们生成非人类风格的文本。此外，我们认为反向交叉熵，即P相对于Q的交叉熵，更能反映人类如何评估模型生成的文本。因此，我们提出了使用MixCE进行学习，这是一个将前向和反向交叉熵混合的目标。我们在已知P的合成数据设置（合成数据）和真实数据上评估了使用这一目标训练的模型，并展示了由此产生的模型生成的文本更好，而无需复杂的解码策略。我们的代码和模型可在以下网址公开获取：https://github.com/bloomberg/mixce-acl2023

English

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may "over-generalize", in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023

MixCE：通过混合正向和反向交叉熵训练自回归语言模型

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

摘要

Support