語言模型中的嵌入層擴展

摘要

我們提出了 SCONE（可擴展、情境化、卸載、N-gram 嵌入），這是一種擴展輸入嵌入層以增強語言模型性能的方法，隨著層大小的擴展。為了避免增加解碼成本，SCONE 保留了原始詞彙，同時為一組常見的 n-gram 引入了嵌入。這些嵌入為每個輸入標記提供了情境化表示，並在訓練期間使用獨立模型進行學習。在推理期間，它們被預先計算並存儲在加速器外的記憶體中，對推理速度幾乎沒有影響。SCONE 實現了兩種新的擴展策略：增加快取的 n-gram 嵌入數量和擴展用於學習它們的模型，同時保持固定的推理時浮點運算數。我們展示了通過擴展這兩個方面，SCONE 能夠在各種語料庫上優於具有 19 億參數基線，同時僅使用一半的推理時浮點運算數。

English

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

語言模型中的嵌入層擴展

Scaling Embedding Layers in Language Models

摘要

Support