T-FREE:通过稀疏表示实现无需分词的生成式LLM,用于内存高效嵌入
T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
June 27, 2024
作者: Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
cs.AI
摘要
在大型语言模型中,分词器对于编码信息至关重要,但是它们的发展最近出现停滞,并且存在固有的弱点。主要的局限包括计算开销大、词汇使用效率低,以及嵌入层和头层过大。此外,它们的性能偏向于参考语料库,导致对少数语言的效果降低。
为了解决这些问题,我们提出了T-FREE,它通过字符三元组上的稀疏激活模式直接嵌入单词,并且不需要参考语料库。T-FREE固有地利用形态相似性,并允许对嵌入层进行强大的压缩。在我们详尽的实验评估中,我们在这些层面上实现了超过85%的参数减少,同时实现了具有竞争力的下游性能。此外,T-FREE在跨语言迁移学习中显示出显著的改进。
English
Tokenizers are crucial for encoding information in Large Language Models, but
their development has recently stagnated, and they contain inherent weaknesses.
Major limitations include computational overhead, ineffective vocabulary use,
and unnecessarily large embedding and head layers. Additionally, their
performance is biased towards a reference corpus, leading to reduced
effectiveness for underrepresented languages.
To remedy these issues, we propose T-FREE, which directly embeds words
through sparse activation patterns over character triplets, and does not
require a reference corpus. T-FREE inherently exploits morphological
similarities and allows for strong compression of embedding layers. In our
exhaustive experimental evaluation, we achieve competitive downstream
performance with a parameter reduction of more than 85% on these layers.
Further, T-FREE shows significant improvements in cross-lingual transfer
learning.Summary
AI-Generated Summary