將1568個詞元壓縮至單一向量並還原:探索嵌入空間容量的極限
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
February 18, 2025
作者: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
cs.AI
摘要
近期一系列研究致力於解決將符號序列壓縮為更短的實值向量序列的問題,這些向量序列可作為輸入,替代符號嵌入或鍵值快取。這些方法能夠減少現有語言模型中的計算量。儘管依賴於強大的模型作為編碼器,可實現的最大無損壓縮比通常不超過十倍。這一現象極為引人注目,因為理論上,即使對於16位精度和適中的向量大小,大型實值向量的最大信息容量也遠超現有壓縮率。在本研究中,我們通過將編碼器替換為逐樣本優化程序,探索了壓縮的極限。我們展示了壓縮比高達1500倍的向量存在,這凸顯了現有解決方案與實際可達方案之間兩個數量級的差距。此外,我們通過實證表明,壓縮極限並非由輸入長度決定,而是由需要減少的資訊不確定性所決定,即該序列在無任何條件下的交叉熵損失。所獲得的極限揭示了輸入嵌入的理論容量與其實際利用之間的顯著差距,表明模型設計中存在著巨大的優化空間。
English
A range of recent works addresses the problem of compression of sequence of
tokens into a shorter sequence of real-valued vectors to be used as inputs
instead of token embeddings or key-value cache. These approaches allow to
reduce the amount of compute in existing language models. Despite relying on
powerful models as encoders, the maximum attainable lossless compression ratio
is typically not higher than x10. This fact is highly intriguing because, in
theory, the maximum information capacity of large real-valued vectors is far
beyond the presented rates even for 16-bit precision and a modest vector size.
In this work, we explore the limits of compression by replacing the encoder
with a per-sample optimization procedure. We show that vectors with compression
ratios up to x1500 exist, which highlights two orders of magnitude gap between
existing and practically attainable solutions. Furthermore, we empirically show
that the compression limits are determined not by the length of the input but
by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
this sequence without any conditioning. The obtained limits highlight the
substantial gap between the theoretical capacity of input embeddings and their
practical utilization, suggesting significant room for optimization in model
design.Summary
AI-Generated Summary