MultiHashFormer：ハッシュベースの生成言語モデル

要旨

言語モデル（LM）は、語彙サイズに比例してスケールする埋め込み行列を使用してトークンを表現する。パラメータ占有量を抑えるため、従来研究ではエンコーダのみのモデルにおいて複数のトークンを単一ベクトルにハッシュ化する手法が提案されている。この手法はパラメータ効率に優れるものの、多対一の衝突が生じるため因果的LMでの利用は不可能である。本論文では、ハッシュベースの自己回帰を可能とする新しいフレームワーク「MultiHashFormer」を提案する。各トークンは、複数の独立したハッシュ関数により生成された短い離散的ハッシュIDの連続である固有のハッシュ署名として表現される。ハッシュエンコーダはこの署名を単一の潜在ベクトルに圧縮し、Transformerデコーダで処理する。その後、ハッシュデコーダは次トークンのハッシュ署名を生成し、テキストに逆写像される。本手法を100M、1B、3Bのパラメータ規模で評価し、MultiHashFormerが複数のベンチマークにおいて標準的なTransformer LMを一貫して上回ることを実証する。さらに、本モデルは多言語語彙拡張をパラメータ占有量一定で処理でき、一切の修正を必要としないことを示す。

English

Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.