多智能體合作式資料選擇以提升LLM預訓練效率

摘要

高效的資料選擇對於加速大型語言模型（LLMs）的預訓練至關重要。儘管已提出各種方法來增強資料效率，但有限的研究已解決這些方法之間的固有衝突，以實現最佳的LLM預訓練資料選擇。為了應對這個問題，我們提出了一種新穎的多智能體協作資料選擇機制。在這個框架中，每個資料選擇方法都作為獨立的智能體，並設計了一個智能體控制台，動態整合整個LLM訓練過程中所有智能體的信息。我們進行了廣泛的實證研究來評估我們的多智能體框架。實驗結果表明，我們的方法顯著提高了資料效率，在LLM訓練中加速了收斂，並在多個語言模型基準測試中相對於最先進方法實現了平均性能提升10.5%。

English

Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain of 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.

多智能體合作式資料選擇以提升LLM預訓練效率

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

摘要

Support