多智能体协作数据选择以实现高效的LLM预训练。
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
October 10, 2024
作者: Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, Conghui He
cs.AI
摘要
高效的数据选择对于加速大型语言模型(LLMs)的预训练至关重要。虽然已经提出了各种方法来增强数据效率,但有限的研究涉及这些方法之间固有的冲突,以实现LLMs预训练的最佳数据选择。为了解决这个问题,我们提出了一种新颖的多智能体协作数据选择机制。在这个框架中,每种数据选择方法充当独立的智能体,并设计了一个智能体控制台,动态整合整个LLMs训练过程中所有智能体的信息。我们进行了大量的实证研究来评估我们的多智能体框架。实验结果表明,我们的方法显著提高了数据效率,在LLMs训练中加快了收敛速度,并在多个语言模型基准测试中,与最先进的方法相比,实现了平均性能提升10.5%。
English
Efficient data selection is crucial to accelerate the pretraining of large
language models (LLMs). While various methods have been proposed to enhance
data efficiency, limited research has addressed the inherent conflicts between
these approaches to achieve optimal data selection for LLM pretraining. To
tackle this problem, we propose a novel multi-agent collaborative data
selection mechanism. In this framework, each data selection method serves as an
independent agent, and an agent console is designed to dynamically integrate
the information from all agents throughout the LLM training process. We conduct
extensive empirical studies to evaluate our multi-agent framework. The
experimental results demonstrate that our approach significantly improves data
efficiency, accelerates convergence in LLM training, and achieves an average
performance gain of 10.5% across multiple language model benchmarks compared to
the state-of-the-art methods.Summary
AI-Generated Summary