虛擬數據：自回歸數據精煉

摘要

我們研究自回歸機器學習任務的數據精煉，其中輸入和輸出具有嚴格的從左到右的因果結構。更具體地說，我們提出了Farzi，它將事件序列數據集總結為少量合成序列 -- Farzi數據 -- 這些序列經過優化，以保持（如果不是提高）模型性能，相較於在完整數據集上進行訓練。在實現上，Farzi通過（i）利用Hessian-Vector Products實現Adam優化器的高效反向模式微分；以及（ii）將高維離散事件空間分解為潛在空間，從而證明促進隱式正則化。在實證方面，對於序列推薦和語言建模任務，我們在Farzi數據上訓練最先進的模型時，能夠實現下游完整數據性能的98-120％，即使Farzi數據的大小僅為原始數據集的0.1％。值得注意的是，能夠用更少的數據訓練出更好的模型，為未來大型自回歸模型的設計提供了新的思路，並為進一步擴大模型和數據規模開辟了新的機遇。

English

We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.

虛擬數據：自回歸數據精煉

Farzi Data: Autoregressive Data Distillation

摘要

Support