虚假数据:自回归数据蒸馏
Farzi Data: Autoregressive Data Distillation
October 15, 2023
作者: Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
cs.AI
摘要
我们研究数据精炼用于自回归机器学习任务,其中输入和输出具有严格的从左到右的因果结构。更具体地,我们提出了Farzi,它将事件序列数据集总结为少量的合成序列 -- Farzi数据 -- 这些数据经过优化,以维持(甚至提高)模型性能,相较于在完整数据集上进行训练。在内部,Farzi通过(i)利用Hessian-Vector Products实现Adam优化器的高效反向模式微分;以及(ii)将高维离散事件空间分解为潜在空间,从而明显促进隐式正则化,实现了内存高效的数据精炼。从经验上看,在顺序推荐和语言建模任务中,我们能够在Farzi数据上训练最先进模型时,实现相当于完整数据性能的98-120%,而这些数据仅占原始数据集的0.1%。值得注意的是,能够用显著较少的数据训练出更好的模型,为未来大型自回归模型的设计提供了启示,并为进一步扩大模型和数据规模开辟了新机会。
English
We study data distillation for auto-regressive machine learning tasks, where
the input and output have a strict left-to-right causal structure. More
specifically, we propose Farzi, which summarizes an event sequence dataset into
a small number of synthetic sequences -- Farzi Data -- which are optimized to
maintain (if not improve) model performance compared to training on the full
dataset. Under the hood, Farzi conducts memory-efficient data distillation by
(i) deriving efficient reverse-mode differentiation of the Adam optimizer by
leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional
discrete event-space into a latent-space which provably promotes implicit
regularization. Empirically, for sequential recommendation and language
modeling tasks, we are able to achieve 98-120% of downstream full-data
performance when training state-of-the-art models on Farzi Data of size as
little as 0.1% of the original dataset. Notably, being able to train better
models with significantly less data sheds light on the design of future large
auto-regressive models, and opens up new opportunities to further scale up
model and data sizes.