嵌套學習：深度學習架構的幻象

摘要

儘管近期在語言模型開發方面取得了進展，但關於此類模型如何實現持續學習/記憶、自我改進及尋找有效解決方案，仍存在根本性挑戰與未解之謎。本文提出一種稱為「嵌套學習」（Nested Learning, NL）的新型學習範式，它能以一套具有嵌套結構、多層級和/或並行化的優化問題來連貫地表徵機器學習模型，每個問題皆擁有獨特的上下文流。透過NL的視角，現有深度學習方法實際上是通過壓縮自身上下文流從數據中學習，而情境學習（in-context learning）在大模型中自然湧現。NL提出一種設計哲學：通過增加層級構建更具表現力的學習算法，從而實現高階情境學習，並有望釋放持續學習的潛力。我們通過三項核心貢獻闡述NL的價值：（1）表達性優化器：揭示傳統基於梯度的優化器（如Adam、動量SGD等）本質上是關聯記憶模塊，其目標是通過梯度下降壓縮梯度信息。基於此洞見，我們提出具備深度記憶和/或更強學習規則的增強型優化器；（2）自修改學習模塊：運用NL對學習算法的見解，我們設計了一種通過學習自身更新算法來實現自我調整的序列模型；（3）連續記憶系統：提出一種泛化傳統長/短期記憶觀點的新記憶系統框架。將自修改序列模型與連續記憶系統結合，我們構建出名為「Hope」的持續學習模塊，在語言建模、知識融合、少樣本泛化任務、持續學習及長上下文推理任務中展現出優異性能。

English

Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.

嵌套學習：深度學習架構的幻象

Nested Learning: The Illusion of Deep Learning Architectures

摘要

Support