將1568個詞元壓縮至單一向量並還原：探索嵌入空間容量的極限

摘要

近期一系列研究致力於解決將符號序列壓縮為更短的實值向量序列的問題，這些向量序列可作為輸入，替代符號嵌入或鍵值快取。這些方法能夠減少現有語言模型中的計算量。儘管依賴於強大的模型作為編碼器，可實現的最大無損壓縮比通常不超過十倍。這一現象極為引人注目，因為理論上，即使對於16位精度和適中的向量大小，大型實值向量的最大信息容量也遠超現有壓縮率。在本研究中，我們通過將編碼器替換為逐樣本優化程序，探索了壓縮的極限。我們展示了壓縮比高達1500倍的向量存在，這凸顯了現有解決方案與實際可達方案之間兩個數量級的差距。此外，我們通過實證表明，壓縮極限並非由輸入長度決定，而是由需要減少的資訊不確定性所決定，即該序列在無任何條件下的交叉熵損失。所獲得的極限揭示了輸入嵌入的理論容量與其實際利用之間的顯著差距，表明模型設計中存在著巨大的優化空間。

English

A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

將1568個詞元壓縮至單一向量並還原：探索嵌入空間容量的極限

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

摘要

Support