鏈式模型學習於語言模型
Chain-of-Model Learning for Language Model
May 17, 2025
作者: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
cs.AI
摘要
本文提出了一種新穎的學習範式,稱為模型鏈(Chain-of-Model, CoM),該範式將因果關係以鏈式結構融入每一層的隱藏狀態中,從而在模型訓練中引入顯著的擴展效率,並在部署時提供推理的靈活性。我們引入了表示鏈(Chain-of-Representation, CoR)的概念,該概念將每一層的隱藏狀態表述為隱藏維度層面上多個子表示(即鏈)的組合。在每一層中,輸出表示中的每個鏈只能查看輸入表示中所有先前的鏈。因此,基於CoM框架構建的模型可以通過基於先前模型(即鏈)增加鏈來逐步擴展模型規模,並通過使用不同數量的鏈來提供多種不同規模的子模型以實現彈性推理。基於這一原理,我們設計了語言模型鏈(Chain-of-Language-Model, CoLM),將CoM的思想融入Transformer架構的每一層中。基於CoLM,我們進一步引入了CoLM-Air,通過引入鍵值共享機制,該機制在第一個鏈中計算所有鍵和值,然後在所有鏈之間共享。這一設計展示了額外的可擴展性,例如實現無縫語言模型切換、預填充加速等。實驗結果表明,我們的CoLM系列能夠實現與標準Transformer相當的性能,同時提供更大的靈活性,例如通過逐步擴展來提高訓練效率,並提供多種不同規模的模型以實現彈性推理,為構建語言模型開闢了一條新途徑。我們的代碼將在未來發佈於:https://github.com/microsoft/CoLM。
English
In this paper, we propose a novel learning paradigm, termed Chain-of-Model
(CoM), which incorporates the causal relationship into the hidden states of
each layer as a chain style, thereby introducing great scaling efficiency in
model training and inference flexibility in deployment. We introduce the
concept of Chain-of-Representation (CoR), which formulates the hidden states at
each layer as a combination of multiple sub-representations (i.e., chains) at
the hidden dimension level. In each layer, each chain from the output
representations can only view all of its preceding chains in the input
representations. Consequently, the model built upon CoM framework can
progressively scale up the model size by increasing the chains based on the
previous models (i.e., chains), and offer multiple sub-models at varying sizes
for elastic inference by using different chain numbers. Based on this
principle, we devise Chain-of-Language-Model (CoLM), which incorporates the
idea of CoM into each layer of Transformer architecture. Based on CoLM, we
further introduce CoLM-Air by introducing a KV sharing mechanism, that computes
all keys and values within the first chain and then shares across all chains.
This design demonstrates additional extensibility, such as enabling seamless LM
switching, prefilling acceleration and so on. Experimental results demonstrate
our CoLM family can achieve comparable performance to the standard Transformer,
while simultaneously enabling greater flexiblity, such as progressive scaling
to improve training efficiency and offer multiple varying model sizes for
elastic inference, paving a a new way toward building language models. Our code
will be released in the future at: https://github.com/microsoft/CoLM.Summary
AI-Generated Summary