大規模言語モデルのトークナイザー性能を評価する公式インド言語間

要旨

Transformerアーキテクチャに基づく大規模言語モデル（LLMs）は、トークナイゼーションが前処理および微調整段階で中心的な役割を果たすことで、さまざまな領域を革新しました。特にインド諸言語向けに調整された多言語モデルでは、効果的なトークナイゼーションがパフォーマンスを最適化する上で重要です。本論文では、インドの22の公用語全てにわたる12つのLLMsで使用されるトークナイザーの包括的な評価を行い、そのトークナイゼーションプロセスの効率性を比較することに焦点を当てています。分析において、正規化されたシーケンス長（NSL）を主要な指標として採用しました。研究結果によると、SUTRAトークナイザーが、14の言語を含む複数のインド諸言語において、他のモデルを凌駕していることが明らかになりました。注目すべき洞察には、SUTRAトークナイザーがインド諸言語を優れた方法で処理する点、GPT-4oが前身であるGPT-4よりもインドの言語を処理する面で進化している点、およびProject Indusが特定の言語において限られたパフォーマンスを示す点が含まれます。この研究は、多言語およびインド中心のモデル向けにターゲットを絞ったトークナイゼーション戦略を開発することの重要性を強調し、トークナイザー設計の将来の改善を促して、言語カバレッジとモデルの効率性を向上させる基盤を築いています。

English

Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.

大規模言語モデルのトークナイザー性能を評価する公式インド言語間

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

要旨

Support