バイトレベルインターフェースによるクロストークナイザーLLM蒸留

要旨

異なるトークナイザーを使用する教師言語モデルから生徒言語モデルへの知識転送である、クロストークナイザー蒸留（CTD）は、現在もほとんど解決されていない問題である。既存の手法は、不一致のある語彙を調整するために経験則的な戦略に依存しており、かなりの複雑さを伴う。本論文では、Byte-Level Distillation（BLD）と呼ばれるシンプルかつ効果的なベースラインを提案する。これは、トークナイザー間で共通のインターフェースであるバイトレベルで動作することでCTDを可能にする。詳細には、教師モデルの出力分布をバイトレベルの確率に変換し、生徒モデルに軽量なバイトレベルのデコーダヘッドを接続し、この共有されたバイトレベルインターフェースを通して蒸留を行う。その単純さにもかかわらず、BLDは、1Bから8Bパラメータのモデルを用いた一連の蒸留タスクにおいて、はるかに複雑なCTD手法と同等か、いくつかのベンチマークではそれを上回る性能を示す。我々の結果は、バイトレベルがクロストークナイザー知識転送のための自然な共通基盤であることを示唆する一方で、全てのタスクとベンチマークで一貫した改善を達成することは依然として困難であり、CTDが未解決の問題であることを改めて強調するものである。

English

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

バイトレベルインターフェースによるクロストークナイザーLLM蒸留

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

要旨

Support