在多语言学习中,数据集不平衡时顺序至关重要。
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
December 11, 2023
作者: Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani
cs.AI
摘要
本文从经验角度研究了多任务学习的优化动态,特别关注那些管理具有显著数据不平衡的任务集合的动态。我们提出了一种简单而有效的方法,即在高资源任务上进行预训练,然后在高/低资源任务的混合上进行微调。我们对这种方法的好处进行了彻底的经验研究和分析,表明相对于标准静态加权的性能折衷曲线,它实现了一致的改进。我们分析了在哪些数据情况下这种方法是适用的,并在神经机器翻译(NMT)和多语言语言建模中通过经验展示了它的改进。
English
In this paper, we empirically study the optimization dynamics of multi-task
learning, particularly focusing on those that govern a collection of tasks with
significant data imbalance. We present a simple yet effective method of
pre-training on high-resource tasks, followed by fine-tuning on a mixture of
high/low-resource tasks. We provide a thorough empirical study and analysis of
this method's benefits showing that it achieves consistent improvements
relative to the performance trade-off profile of standard static weighting. We
analyze under what data regimes this method is applicable and show its
improvements empirically in neural machine translation (NMT) and multi-lingual
language modeling.Summary
AI-Generated Summary