虚假数据：自回归数据蒸馏

摘要

我们研究数据精炼用于自回归机器学习任务，其中输入和输出具有严格的从左到右的因果结构。更具体地，我们提出了Farzi，它将事件序列数据集总结为少量的合成序列 -- Farzi数据 -- 这些数据经过优化，以维持（甚至提高）模型性能，相较于在完整数据集上进行训练。在内部，Farzi通过（i）利用Hessian-Vector Products实现Adam优化器的高效反向模式微分；以及（ii）将高维离散事件空间分解为潜在空间，从而明显促进隐式正则化，实现了内存高效的数据精炼。从经验上看，在顺序推荐和语言建模任务中，我们能够在Farzi数据上训练最先进模型时，实现相当于完整数据性能的98-120%，而这些数据仅占原始数据集的0.1%。值得注意的是，能够用显著较少的数据训练出更好的模型，为未来大型自回归模型的设计提供了启示，并为进一步扩大模型和数据规模开辟了新机会。

English

We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.

虚假数据：自回归数据蒸馏

Farzi Data: Autoregressive Data Distillation

摘要

Support