ChatPaper.aiChatPaper

L^2M:長上下文語言建模的互信息縮放定律

L^2M: Mutual Information Scaling Law for Long-Context Language Modeling

March 6, 2025
作者: Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
cs.AI

摘要

我們嚴格地建立了一個自然語言中的二元互信息尺度定律,該定律支配著長程依賴關係。我們展示的這一尺度定律,與傳統的兩點互信息不同且獨立地進行尺度變化,是理解長上下文語言建模的關鍵。利用這一尺度定律,我們提出了長上下文語言建模(L^2M)條件,該條件將模型有效建模長上下文長度的能力與其用於存儲過去信息的潛在狀態大小的尺度變化聯繫起來。我們的結果通過在變壓器和狀態空間模型上的實驗得到了驗證。這項工作建立了一個理論基礎,指導大型語言模型向更長上下文長度的發展。
English
We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L^2M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.

Summary

AI-Generated Summary

PDF202March 7, 2025