ファージ・データ：自己回帰型データ蒸留

要旨

自己回帰型機械学習タスクにおけるデータ蒸留を研究する。ここでは、入力と出力が厳密な左から右への因果構造を持つ。具体的には、Farziを提案する。これは、イベントシーケンスデータセットを少数の合成シーケンス（Farzi Data）に要約し、完全なデータセットでの学習と比較してモデル性能を維持（あるいは向上）させるように最適化する。内部的には、Farziはメモリ効率の良いデータ蒸留を以下の方法で行う：(i) Hessian-Vector Productsを活用してAdamオプティマイザの効率的な逆モード微分を導出する、(ii) 高次元の離散イベント空間を潜在空間に分解し、暗黙の正則化を促進することを証明する。実験的に、逐次推薦と言語モデリングタスクにおいて、元のデータセットの0.1%という小さなサイズのFarzi Dataで最先端モデルを学習する場合、下流タスクの完全データ性能の98-120%を達成できる。特に、大幅に少ないデータでより良いモデルを学習できることは、将来の大規模自己回帰モデルの設計に光を当て、モデルとデータサイズをさらにスケールアップする新たな機会を開くものである。

English

We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.

ファージ・データ：自己回帰型データ蒸留

Farzi Data: Autoregressive Data Distillation

要旨

Support