多智能體合作式資料選擇以提升LLM預訓練效率
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
October 10, 2024
作者: Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, Conghui He
cs.AI
摘要
高效的資料選擇對於加速大型語言模型(LLMs)的預訓練至關重要。儘管已提出各種方法來增強資料效率,但有限的研究已解決這些方法之間的固有衝突,以實現最佳的LLM預訓練資料選擇。為了應對這個問題,我們提出了一種新穎的多智能體協作資料選擇機制。在這個框架中,每個資料選擇方法都作為獨立的智能體,並設計了一個智能體控制台,動態整合整個LLM訓練過程中所有智能體的信息。我們進行了廣泛的實證研究來評估我們的多智能體框架。實驗結果表明,我們的方法顯著提高了資料效率,在LLM訓練中加速了收斂,並在多個語言模型基準測試中相對於最先進方法實現了平均性能提升10.5%。
English
Efficient data selection is crucial to accelerate the pretraining of large
language models (LLMs). While various methods have been proposed to enhance
data efficiency, limited research has addressed the inherent conflicts between
these approaches to achieve optimal data selection for LLM pretraining. To
tackle this problem, we propose a novel multi-agent collaborative data
selection mechanism. In this framework, each data selection method serves as an
independent agent, and an agent console is designed to dynamically integrate
the information from all agents throughout the LLM training process. We conduct
extensive empirical studies to evaluate our multi-agent framework. The
experimental results demonstrate that our approach significantly improves data
efficiency, accelerates convergence in LLM training, and achieves an average
performance gain of 10.5% across multiple language model benchmarks compared to
the state-of-the-art methods.Summary
AI-Generated Summary