T-FREE：通过稀疏表示实现无需分词的生成式LLM，用于内存高效嵌入

摘要

在大型语言模型中，分词器对于编码信息至关重要，但是它们的发展最近出现停滞，并且存在固有的弱点。主要的局限包括计算开销大、词汇使用效率低，以及嵌入层和头层过大。此外，它们的性能偏向于参考语料库，导致对少数语言的效果降低。为了解决这些问题，我们提出了T-FREE，它通过字符三元组上的稀疏激活模式直接嵌入单词，并且不需要参考语料库。T-FREE固有地利用形态相似性，并允许对嵌入层进行强大的压缩。在我们详尽的实验评估中，我们在这些层面上实现了超过85%的参数减少，同时实现了具有竞争力的下游性能。此外，T-FREE在跨语言迁移学习中显示出显著的改进。

English

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

T-FREE：通过稀疏表示实现无需分词的生成式LLM，用于内存高效嵌入

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

摘要

Support