上下文学习和奥卡姆剃刀
In-context learning and Occam's razor
October 17, 2024
作者: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie
cs.AI
摘要
机器学习的目标是泛化。尽管“没有免费午餐”定理指出,我们无法在没有进一步假设的情况下获得泛化的理论保证,但在实践中,我们观察到解释训练数据的简单模型具有最佳的泛化能力:这一原则被称为奥卡姆剃刀。尽管需要简单模型,但目前大多数机器学习方法仅最小化训练误差,并且最多通过正则化或架构设计间接促进简单性。在这里,我们建立了奥卡姆剃刀与上下文学习之间的联系:这是某些序列模型(如Transformer)在推理时从序列中过去的观察中学习的一种新兴能力。具体而言,我们展示了用于训练上下文学习者的下一个标记预测损失直接等同于一种称为预测编码的数据压缩技术,通过最小化这种损失,实际上是联合最小化了从上下文中隐式学习的模型的训练误差和复杂性。我们的理论和用于支持它的实证实验不仅提供了上下文学习的规范解释,还阐明了当前上下文学习方法的缺点,提出了改进方法。我们将我们的代码公开在https://github.com/3rdCore/PrequentialCode。
English
The goal of machine learning is generalization. While the No Free Lunch
Theorem states that we cannot obtain theoretical guarantees for generalization
without further assumptions, in practice we observe that simple models which
explain the training data generalize best: a principle called Occam's razor.
Despite the need for simple models, most current approaches in machine learning
only minimize the training error, and at best indirectly promote simplicity
through regularization or architecture design. Here, we draw a connection
between Occam's razor and in-context learning: an emergent ability of certain
sequence models like Transformers to learn at inference time from past
observations in a sequence. In particular, we show that the next-token
prediction loss used to train in-context learners is directly equivalent to a
data compression technique called prequential coding, and that minimizing this
loss amounts to jointly minimizing both the training error and the complexity
of the model that was implicitly learned from context. Our theory and the
empirical experiments we use to support it not only provide a normative account
of in-context learning, but also elucidate the shortcomings of current
in-context learning methods, suggesting ways in which they can be improved. We
make our code available at https://github.com/3rdCore/PrequentialCode.Summary
AI-Generated Summary