虛擬數據:自回歸數據精煉
Farzi Data: Autoregressive Data Distillation
October 15, 2023
作者: Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
cs.AI
摘要
我們研究自回歸機器學習任務的數據精煉,其中輸入和輸出具有嚴格的從左到右的因果結構。更具體地說,我們提出了Farzi,它將事件序列數據集總結為少量合成序列 -- Farzi數據 -- 這些序列經過優化,以保持(如果不是提高)模型性能,相較於在完整數據集上進行訓練。在實現上,Farzi通過(i)利用Hessian-Vector Products實現Adam優化器的高效反向模式微分;以及(ii)將高維離散事件空間分解為潛在空間,從而證明促進隱式正則化。在實證方面,對於序列推薦和語言建模任務,我們在Farzi數據上訓練最先進的模型時,能夠實現下游完整數據性能的98-120%,即使Farzi數據的大小僅為原始數據集的0.1%。值得注意的是,能夠用更少的數據訓練出更好的模型,為未來大型自回歸模型的設計提供了新的思路,並為進一步擴大模型和數據規模開辟了新的機遇。
English
We study data distillation for auto-regressive machine learning tasks, where
the input and output have a strict left-to-right causal structure. More
specifically, we propose Farzi, which summarizes an event sequence dataset into
a small number of synthetic sequences -- Farzi Data -- which are optimized to
maintain (if not improve) model performance compared to training on the full
dataset. Under the hood, Farzi conducts memory-efficient data distillation by
(i) deriving efficient reverse-mode differentiation of the Adam optimizer by
leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional
discrete event-space into a latent-space which provably promotes implicit
regularization. Empirically, for sequential recommendation and language
modeling tasks, we are able to achieve 98-120% of downstream full-data
performance when training state-of-the-art models on Farzi Data of size as
little as 0.1% of the original dataset. Notably, being able to train better
models with significantly less data sheds light on the design of future large
auto-regressive models, and opens up new opportunities to further scale up
model and data sizes.