TiKMiX: 언어 모델 사전 학습을 위한 동적 혼합에 데이터 영향력 통합

초록

언어 모델의 사전 학습에 사용되는 데이터 혼합은 최종 성능의 초석이 됩니다. 그러나 정적 혼합 전략은 최적이 아닌데, 이는 모델의 다양한 데이터 도메인에 대한 학습 선호도가 훈련 과정에서 동적으로 변화하기 때문입니다. 특히, 이러한 진화하는 선호도를 계산적으로 효율적으로 관찰하는 것은 여전히 중요한 과제로 남아 있습니다. 이를 해결하기 위해, 우리는 모델의 진화하는 선호도에 따라 데이터 혼합을 동적으로 조정하는 TiKMiX 방법을 제안합니다. TiKMiX는 데이터 도메인이 모델에 미치는 영향을 평가하기 위한 효율적인 지표인 그룹 영향력(Group Influence)을 도입합니다. 이 지표는 데이터 혼합 문제를 최적의 영향력 극대화 분포를 탐색하는 문제로 공식화할 수 있게 합니다. 우리는 이를 두 가지 접근 방식으로 해결합니다: 직접 최적화를 수행하는 TiKMiX-D와, 더 나은 혼합을 예측하기 위해 회귀 모델을 사용하는 TiKMiX-M입니다. 우리는 최대 1조 개의 토큰을 사용하여 다양한 파라미터 수의 모델을 훈련했습니다. TiKMiX-D는 REGMIX와 같은 최첨단 방법의 성능을 능가하면서도 단 20%의 계산 자원만을 사용합니다. TiKMiX-M은 9개의 다운스트림 벤치마크에서 평균 2%의 성능 향상을 이끌어냅니다. 우리의 실험은 모델의 데이터 선호도가 훈련 진행과 규모에 따라 진화함을 보여주며, 이러한 선호도를 직접 측정하는 그룹 영향력을 기반으로 데이터 혼합을 동적으로 조정함으로써 정적 비율에서 관찰된 데이터의 소화 부족을 완화하고 성능을 크게 개선할 수 있음을 입증합니다.

English

The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained models with different numbers of parameters, on up to 1 trillion tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like REGMIX while using just 20% of the computational resources. TiKMiX-M leads to an average performance gain of 2% across 9 downstream benchmarks. Our experiments reveal that a model's data preferences evolve with training progress and scale, and we demonstrate that dynamically adjusting the data mixture based on Group Influence, a direct measure of these preferences, significantly improves performance by mitigating the underdigestion of data seen with static ratios.

TiKMiX: 언어 모델 사전 학습을 위한 동적 혼합에 데이터 영향력 통합

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

초록

Support