記憶-壓縮循環提升泛化能力

摘要

我們從理論上證明，泛化能力的提升不僅依賴於數據規模的擴大，還可通過壓縮內部表徵來實現。為將這一洞見付諸實踐，我們引入了信息瓶頸語言建模（IBLM）目標，該目標將語言建模重新定義為一個約束優化問題：在保證最佳預測性能的前提下，最小化表徵熵。實證研究中，我們觀察到在大型語言模型（LLM）預訓練過程中出現了一種記憶-壓縮的循環現象，這通過交叉熵與基於矩陣的熵（MBE，衡量表徵熵的一種指標）之間正負梯度對齊的振盪得以證實。這一模式與IBLM所預測的預測-壓縮權衡高度吻合，同時也與生物體在清醒學習與睡眠鞏固之間的交替相平行。基於此觀察，我們提出了門控相變（GAPT）訓練算法，該算法能夠自適應地在記憶與壓縮階段之間切換。將GAPT應用於GPT-2在FineWeb數據集上的預訓練時，MBE降低了50%，交叉熵提升了4.8%。在算術乘法預訓練任務中，GAPT使OOD泛化能力提升了35%。在模擬災難性遺忘的設置中，GAPT通過壓縮與分離表徵，減少了干擾，分離效果提升了97%，這與睡眠鞏固的功能性作用相呼應。

English

We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

記憶-壓縮循環提升泛化能力

Memorization-Compression Cycles Improve Generalization

摘要

Support