记忆-压缩循环提升泛化能力

摘要

我们从理论上证明，泛化能力的提升不仅依赖于数据规模的扩大，还通过压缩内部表征得以实现。为实践这一洞见，我们引入了信息瓶颈语言建模（IBLM）目标，该目标将语言建模重构为一个约束优化问题：在保证最优预测性能的前提下，最小化表征的熵。实证中，我们观察到在大型语言模型（LLM）预训练过程中，出现了一种记忆-压缩的循环现象，这通过交叉熵与矩阵基熵（MBE，一种表征熵的度量）之间正负梯度对齐的振荡得以证实。这一模式紧密映射了IBLM所规定的预测-压缩权衡，同时也与生物体在清醒学习与睡眠巩固之间的交替相呼应。受此观察启发，我们提出了门控相变（GAPT）训练算法，该算法能够自适应地在记忆与压缩阶段之间切换。当应用于GPT-2在FineWeb数据集上的预训练时，GAPT将MBE降低了50%，并使交叉熵提升了4.8%。在算术乘法预训练任务中，GAPT将OOD泛化能力提高了35%。在一个旨在模拟灾难性遗忘的设定中，GAPT通过压缩和分离表征减少了干扰，实现了97%的分离度提升——这与睡眠巩固的功能角色相平行。

English

We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

记忆-压缩循环提升泛化能力

Memorization-Compression Cycles Improve Generalization

摘要

Support