T-FREE：通過稀疏表示實現無需分詞的生成式LLM，以提高記憶效率的嵌入。

摘要

在大型語言模型中，分詞器對於編碼信息至關重要，但其發展最近出現停滯現象，並且存在固有弱點。主要限制包括計算開銷過大、詞彙使用效率低下，以及嵌入層和頭層過大。此外，它們的性能偏向於參考語料庫，導致對於少數語言的效果降低。為了解決這些問題，我們提出了T-FREE，它通過字符三元組上的稀疏激活模式直接嵌入單詞，並且不需要參考語料庫。T-FREE固有地利用形態相似性，並允許對嵌入層進行強大的壓縮。在我們的詳盡實驗評估中，我們在這些層上實現了超過85%的參數減少，同時實現了具有競爭力的下游性能。此外，T-FREE在跨語言轉移學習方面顯示出顯著的改進。

English

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

T-FREE：通過稀疏表示實現無需分詞的生成式LLM，以提高記憶效率的嵌入。

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

摘要

Support