在多語言學習中,數據集不平衡時,順序至關重要。
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
December 11, 2023
作者: Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani
cs.AI
摘要
本文我們從實證角度研究多任務學習的優化動態,特別關注管理一組具有顯著數據不平衡的任務。我們提出了一種簡單而有效的方法,即在高資源任務上進行預訓練,然後在高/低資源任務的混合上進行微調。我們對這種方法的好處進行了全面的實證研究和分析,顯示相對於標準靜態加權的性能折衷配置文件,它實現了一致的改進。我們分析了這種方法適用於哪些數據情況,並在神經機器翻譯(NMT)和多語言語言建模中實證展示了其改進。
English
In this paper, we empirically study the optimization dynamics of multi-task
learning, particularly focusing on those that govern a collection of tasks with
significant data imbalance. We present a simple yet effective method of
pre-training on high-resource tasks, followed by fine-tuning on a mixture of
high/low-resource tasks. We provide a thorough empirical study and analysis of
this method's benefits showing that it achieves consistent improvements
relative to the performance trade-off profile of standard static weighting. We
analyze under what data regimes this method is applicable and show its
improvements empirically in neural machine translation (NMT) and multi-lingual
language modeling.Summary
AI-Generated Summary