超疎なメモリネットワーク

要旨

Transformerモデルの性能は、そのパラメータ数と計算複雑さとの間に指数関係があることが広く認識されています。Mixture of Experts（MoE）のような手法は、パラメータ数と計算複雑さを分離するものの、高いメモリアクセスコストによる推論の課題に直面しています。本研究では、これらの制限に対処するために、大規模で超疎なメモリレイヤーを組み込んだUltraMemを導入しています。当社の手法は、モデルの性能を維持しつつ、推論のレイテンシーを大幅に低減します。また、この新しいアーキテクチャのスケーリング則を調査し、従来のモデルを凌駕するだけでなく、有利なスケーリング特性を示すことを実証しています。実験では、最大2000万のメモリスロットを持つネットワークを訓練しています。その結果、当社の手法が所与の計算予算内で最先端の推論速度とモデル性能を達成していることが示されています。

English

It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms traditional models. In our experiments, we train networks with up to 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.

超疎なメモリネットワーク

Ultra-Sparse Memory Network

要旨

Support