TokenFormer：重新思考使用标记化模型的Transformer参数缩放

摘要

由于在各个领域表现出色，变压器已成为基础模型中占主导地位的架构。然而，扩展这些模型的巨大成本仍然是一个重要关注点。这个问题主要源于它们对线性投影中固定数量参数的依赖。当引入架构修改（例如通道维度）时，整个模型通常需要从头开始重新训练。随着模型规模的不断增长，这种策略导致计算成本越来越高，变得难以持续。为了解决这个问题，我们引入了TokenFormer，这是一种本地可扩展的架构，利用注意力机制不仅用于输入令牌之间的计算，还用于令牌与模型参数之间的交互，从而增强了架构的灵活性。通过将模型参数视为令牌，我们用我们的令牌-参数注意力层取代了变压器中的所有线性投影，其中输入令牌充当查询，模型参数充当键和值。这种重新表述允许逐步有效地扩展模型，而无需重新从头开始训练。我们的模型通过逐步添加新的键-值参数对，从124M扩展到1.4B参数，实现了与从头开始训练的变压器相当的性能，同时大大降低了训练成本。代码和模型可在https://github.com/Haiyang-W/TokenFormer找到。

English

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.

TokenFormer：重新思考使用标记化模型的Transformer参数缩放

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

摘要

Support