語言模型中的嵌入層擴展
Scaling Embedding Layers in Language Models
February 3, 2025
作者: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
cs.AI
摘要
我們提出了 SCONE(可擴展、情境化、卸載、N-gram 嵌入),這是一種擴展輸入嵌入層以增強語言模型性能的方法,隨著層大小的擴展。為了避免增加解碼成本,SCONE 保留了原始詞彙,同時為一組常見的 n-gram 引入了嵌入。這些嵌入為每個輸入標記提供了情境化表示,並在訓練期間使用獨立模型進行學習。在推理期間,它們被預先計算並存儲在加速器外的記憶體中,對推理速度幾乎沒有影響。SCONE 實現了兩種新的擴展策略:增加快取的 n-gram 嵌入數量和擴展用於學習它們的模型,同時保持固定的推理時浮點運算數。我們展示了通過擴展這兩個方面,SCONE 能夠在各種語料庫上優於具有 19 億參數基線,同時僅使用一半的推理時浮點運算數。
English
We propose SCONE (Scalable, Contextualized,
Offloaded, N-gram Embedding), a method for
extending input embedding layers to enhance language model performance as layer
size scales. To avoid increased decoding costs, SCONE retains the original
vocabulary while introducing embeddings for a set of frequent n-grams. These
embeddings provide contextualized representation for each input token and are
learned with a separate model during training. During inference, they are
precomputed and stored in off-accelerator memory with minimal impact on
inference speed. SCONE enables two new scaling strategies: increasing the
number of cached n-gram embeddings and scaling the model used to learn them,
all while maintaining fixed inference-time FLOPS. We show that scaling both
aspects allows SCONE to outperform a 1.9B parameter baseline across diverse
corpora, while using only half the inference-time FLOPS.Summary
AI-Generated Summary