嵌套式学习：深度学习架构的幻象

摘要

尽管近期在语言模型开发等领域取得了进展，但关于此类模型如何实现持续学习/记忆、自我优化及寻找有效解决方案等根本性挑战与未解之谜依然存在。本文提出一种名为"嵌套学习"的新型学习范式，该范式通过一组具有各自上下文流的嵌套式、多层级和/或并行优化问题，来连贯地表征机器学习模型。透过NL视角可发现，现有深度学习方法通过压缩自身上下文流从数据中学习，而上下文学习能力会在大模型中自然涌现。NL提出了一种设计理念：通过增加层级构建更具表达力的学习算法，从而实现高阶上下文学习，并有望解锁持续学习能力。我们通过三项核心贡献来论证NL的价值：（1）表达性优化器：证明Adam、带动量的SGD等基于梯度的优化器实质上是关联记忆模块，其通过梯度下降压缩梯度信息。基于此发现，我们提出了具有深度记忆和/或更强学习规则的其他表达性优化器；（2）自修改学习模块：利用NL对学习算法的新见解，构建了能通过学习自身更新算法来实现自我修改的序列模型；（3）连续记忆系统：提出新的记忆系统框架，泛化了传统长短时记忆的视角。将自修改序列模型与连续记忆系统结合，我们开发出名为"Hope"的持续学习模块，在语言建模、知识融合、小样本泛化任务、持续学习及长上下文推理任务中展现出优异性能。

English

Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.

嵌套式学习：深度学习架构的幻象

Nested Learning: The Illusion of Deep Learning Architectures

摘要

Support