SpacTor-T5:使用跨度损坏和替换令牌检测对 T5 模型进行预训练
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection
January 24, 2024
作者: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
cs.AI
摘要
预训练大型语言模型被认为是极其资源密集且经常低效的,未充分利用训练文本序列中所包含的信息。在本文中,我们提出了SpacTor,一种新的训练过程,包括(1)结合了跨度损坏(SC)和标记替换检测(RTD)的混合目标,以及(2)一个两阶段课程,通过初始tau次迭代优化混合目标,然后过渡到标准的SC损失。我们通过实验证明,混合目标的有效性与两阶段预训练时间表相关,并对为何如此进行了广泛分析。在我们对编码器-解码器架构(T5)在各种自然语言处理任务上的实验中,SpacTor-T5在保持与标准SC预训练相同的下游性能的同时,实现了预训练迭代次数减少50%和总FLOPs减少40%。或者,在相同的计算预算下,我们发现SpacTor导致了明显改善的下游基准性能。
English
Pre-training large language models is known to be extremely resource
intensive and often times inefficient, under-utilizing the information
encapsulated in the training text sequences. In this paper, we present SpacTor,
a new training procedure consisting of (1) a hybrid objective combining span
corruption (SC) and token replacement detection (RTD), and (2) a two-stage
curriculum that optimizes the hybrid objective over the initial tau
iterations, then transitions to standard SC loss. We show empirically that the
effectiveness of the hybrid objective is tied to the two-stage pre-training
schedule, and provide extensive analysis on why this is the case. In our
experiments with encoder-decoder architectures (T5) on a variety of NLP tasks,
SpacTor-T5 yields the same downstream performance as standard SC pre-training,
while enabling a 50% reduction in pre-training iterations and 40% reduction in
total FLOPs. Alternatively, given the same amount of computing budget, we find
that SpacTor results in significantly improved downstream benchmark
performance.