파르지 데이터: 자기회귀적 데이터 증류

초록

우리는 입력과 출력이 엄격한 좌측에서 우측으로의 인과적 구조를 가지는 자기회귀적(auto-regressive) 기계 학습 작업을 위한 데이터 증류(data distillation)를 연구한다. 보다 구체적으로, 우리는 Farzi를 제안하는데, 이는 이벤트 시퀀스 데이터셋을 소수의 합성 시퀀스들로 요약한 Farzi Data를 생성하며, 이는 전체 데이터셋으로 학습했을 때의 모델 성능을 유지(또는 개선)하도록 최적화된다. 내부적으로 Farzi는 (i) 헤시안-벡터 곱(Hessian-Vector Products)을 활용하여 Adam 옵티마이저의 효율적인 역방향 미분을 유도하고, (ii) 고차원의 이산 이벤트 공간을 잠재 공간으로 분해함으로써 암묵적 정규화(implicit regularization)를 촉진하는 메모리 효율적인 데이터 증류를 수행한다. 실험적으로, 순차적 추천 및 언어 모델링 작업에서, 원본 데이터셋 크기의 0.1%에 불과한 Farzi Data로 최신 모델을 학습할 때, 전체 데이터로 학습한 성능의 98-120%를 달성할 수 있었다. 특히, 상당히 적은 데이터로 더 나은 모델을 학습할 수 있다는 점은 미래의 대규모 자기회귀 모델 설계에 대한 통찰을 제공하며, 모델 및 데이터 크기를 더욱 확장할 수 있는 새로운 기회를 열어준다.

English

We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.

파르지 데이터: 자기회귀적 데이터 증류

Farzi Data: Autoregressive Data Distillation

초록

Support