记忆-压缩循环提升泛化能力
Memorization-Compression Cycles Improve Generalization
May 13, 2025
作者: Fangyuan Yu
cs.AI
摘要
我们从理论上证明,泛化能力的提升不仅依赖于数据规模的扩大,还通过压缩内部表征得以实现。为实践这一洞见,我们引入了信息瓶颈语言建模(IBLM)目标,该目标将语言建模重构为一个约束优化问题:在保证最优预测性能的前提下,最小化表征的熵。实证中,我们观察到在大型语言模型(LLM)预训练过程中,出现了一种记忆-压缩的循环现象,这通过交叉熵与矩阵基熵(MBE,一种表征熵的度量)之间正负梯度对齐的振荡得以证实。这一模式紧密映射了IBLM所规定的预测-压缩权衡,同时也与生物体在清醒学习与睡眠巩固之间的交替相呼应。受此观察启发,我们提出了门控相变(GAPT)训练算法,该算法能够自适应地在记忆与压缩阶段之间切换。当应用于GPT-2在FineWeb数据集上的预训练时,GAPT将MBE降低了50%,并使交叉熵提升了4.8%。在算术乘法预训练任务中,GAPT将OOD泛化能力提高了35%。在一个旨在模拟灾难性遗忘的设定中,GAPT通过压缩和分离表征减少了干扰,实现了97%的分离度提升——这与睡眠巩固的功能角色相平行。
English
We prove theoretically that generalization improves not only through data
scaling but also by compressing internal representations. To operationalize
this insight, we introduce the Information Bottleneck Language Modeling (IBLM)
objective, which reframes language modeling as a constrained optimization
problem: minimizing representation entropy subject to optimal prediction
performance. Empirically, we observe an emergent memorization-compression cycle
during LLM pretraining, evidenced by oscillation positive/negative gradient
alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of
representation entropy. This pattern closely mirrors the predictive-compressive
trade-off prescribed by IBLM and also parallels the biological alternation
between awake learning and sleep consolidation. Motivated by this observation,
we propose Gated Phase Transition (GAPT), a training algorithm that adaptively
switches between memorization and compression phases. When applied to GPT-2
pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves
cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining
task on arithmetic multiplication. In a setting designed to simulate
catastrophic forgetting, GAPT reduces interference by compressing and
separating representations, achieving a 97% improvement in separation -
paralleling the functional role of sleep consolidation.Summary
AI-Generated Summary