链式模型学习在语言模型中的应用
Chain-of-Model Learning for Language Model
May 17, 2025
作者: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
cs.AI
摘要
本文提出了一种新颖的学习范式,称为链式模型(Chain-of-Model, CoM),该范式将因果关系以链式结构融入每一层的隐藏状态中,从而在模型训练中引入了显著的扩展效率,并在部署时提供了灵活的推理能力。我们引入了链式表示(Chain-of-Representation, CoR)的概念,将每一层的隐藏状态在隐藏维度层面表述为多个子表示(即链)的组合。在每一层中,输出表示中的每条链仅能查看输入表示中所有先前的链。因此,基于CoM框架构建的模型能够通过在前序模型(即链)基础上增加链来逐步扩展模型规模,并通过使用不同数量的链提供多种不同大小的子模型,实现弹性推理。基于这一原理,我们设计了链式语言模型(Chain-of-Language-Model, CoLM),将CoM的思想融入Transformer架构的每一层。在CoLM的基础上,我们进一步引入了CoLM-Air,通过引入KV共享机制,在第一条链中计算所有键和值,并在所有链之间共享。这一设计展示了额外的扩展性,例如实现无缝语言模型切换、预填充加速等功能。实验结果表明,我们的CoLM系列模型能够达到与标准Transformer相当的性能,同时提供了更大的灵活性,例如通过逐步扩展提高训练效率,并提供多种不同大小的模型用于弹性推理,为构建语言模型开辟了一条新途径。我们的代码将在未来发布于:https://github.com/microsoft/CoLM。
English
In this paper, we propose a novel learning paradigm, termed Chain-of-Model
(CoM), which incorporates the causal relationship into the hidden states of
each layer as a chain style, thereby introducing great scaling efficiency in
model training and inference flexibility in deployment. We introduce the
concept of Chain-of-Representation (CoR), which formulates the hidden states at
each layer as a combination of multiple sub-representations (i.e., chains) at
the hidden dimension level. In each layer, each chain from the output
representations can only view all of its preceding chains in the input
representations. Consequently, the model built upon CoM framework can
progressively scale up the model size by increasing the chains based on the
previous models (i.e., chains), and offer multiple sub-models at varying sizes
for elastic inference by using different chain numbers. Based on this
principle, we devise Chain-of-Language-Model (CoLM), which incorporates the
idea of CoM into each layer of Transformer architecture. Based on CoLM, we
further introduce CoLM-Air by introducing a KV sharing mechanism, that computes
all keys and values within the first chain and then shares across all chains.
This design demonstrates additional extensibility, such as enabling seamless LM
switching, prefilling acceleration and so on. Experimental results demonstrate
our CoLM family can achieve comparable performance to the standard Transformer,
while simultaneously enabling greater flexiblity, such as progressive scaling
to improve training efficiency and offer multiple varying model sizes for
elastic inference, paving a a new way toward building language models. Our code
will be released in the future at: https://github.com/microsoft/CoLM.Summary
AI-Generated Summary