T-FREE: 메모리 효율적인 임베딩을 위한 희소 표현 기반 토크나이저 없는 생성형 대형 언어 모델

초록

토크나이저는 대규모 언어 모델에서 정보를 인코딩하는 데 중요한 역할을 하지만, 최근 그 개발이 정체 상태에 있으며 본질적인 약점을 가지고 있습니다. 주요 한계로는 계산 오버헤드, 비효율적인 어휘 사용, 그리고 불필요하게 큰 임베딩 및 헤드 레이어 등이 있습니다. 또한, 토크나이저의 성능은 참조 코퍼스에 편향되어 있어, 소수 언어에 대한 효과성이 감소하는 문제가 있습니다. 이러한 문제를 해결하기 위해, 우리는 T-FREE를 제안합니다. T-FREE는 문자 삼중항에 대한 희소 활성화 패턴을 통해 단어를 직접 임베딩하며, 참조 코퍼스가 필요하지 않습니다. T-FREE는 형태론적 유사성을 본질적으로 활용하며, 임베딩 레이어의 강력한 압축을 가능하게 합니다. 우리의 철저한 실험 평가에서, 이러한 레이어에서 85% 이상의 매개변수 감소와 함께 경쟁력 있는 다운스트림 성능을 달성했습니다. 더 나아가, T-FREE는 크로스-링구얼 전이 학습에서 상당한 개선을 보여줍니다.

English

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

T-FREE: 메모리 효율적인 임베딩을 위한 희소 표현 기반 토크나이저 없는 생성형 대형 언어 모델

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

초록

Support