言語モデルのための連鎖モデル学習

要旨

本論文では、Chain-of-Model（CoM）と呼ばれる新しい学習パラダイムを提案する。このパラダイムは、因果関係を各層の隠れ状態に連鎖形式で組み込むことで、モデルトレーニングにおけるスケーリング効率とデプロイ時の推論柔軟性を大幅に向上させる。我々は、Chain-of-Representation（CoR）の概念を導入し、各層の隠れ状態を隠れ次元レベルでの複数のサブ表現（すなわち連鎖）の組み合わせとして定式化する。各層において、出力表現の各連鎖は、入力表現におけるその前のすべての連鎖のみを参照することができる。その結果、CoMフレームワークに基づいて構築されたモデルは、前のモデル（すなわち連鎖）に基づいて連鎖を増やすことでモデルサイズを段階的に拡大し、異なる連鎖数を使用することでさまざまなサイズの複数のサブモデルを提供し、弾力的な推論を可能にする。この原理に基づいて、我々はChain-of-Language-Model（CoLM）を考案し、CoMのアイデアをTransformerアーキテクチャの各層に組み込む。CoLMに基づいて、さらにKV共有メカニズムを導入したCoLM-Airを提案する。この設計は、最初の連鎖内ですべてのキーと値を計算し、その後すべての連鎖間で共有するものであり、シームレスなLM切り替えやプリフィリングの加速などの追加の拡張性を実証する。実験結果は、我々のCoLMファミリーが標準Transformerと同等の性能を達成しつつ、トレーニング効率を向上させるための段階的スケーリングや、弾力的な推論のための複数の異なるモデルサイズの提供など、より大きな柔軟性を同時に実現することを示しており、言語モデル構築に向けた新たな道を切り開くものである。我々のコードは、将来https://github.com/microsoft/CoLMで公開される予定である。

English

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.

言語モデルのための連鎖モデル学習

Chain-of-Model Learning for Language Model

要旨

Support