MIG：通過語義空間中的信息增益最大化實現指令微調的自動數據選擇

摘要

數據質量與多樣性是構建有效指令微調數據集的關鍵。隨著開源指令微調數據集的日益增多，從海量數據中自動選取高質量且多樣化的子集顯得尤為重要。現有方法通常優先考慮實例質量，並使用啟發式規則來保持多樣性。然而，這種缺乏對整個數據集全面視角的做法往往導致次優結果。此外，啟發式規則通常聚焦於嵌入空間中的距離或聚類，這難以準確捕捉語義空間中複雜指令的意圖。為彌補這一差距，我們提出了一種統一的方法來量化數據集的信息量。該方法通過構建標籤圖來建模語義空間，並基於圖中信息的分佈來量化多樣性。基於此度量，我們進一步引入了一種高效的採樣方法，該方法迭代地選取數據樣本，以最大化語義空間中的信息增益（MIG）。在多種數據集和基礎模型上的實驗表明，MIG始終優於最先進的方法。值得注意的是，使用MIG採樣的5% Tulu3數據進行微調的模型，其性能與在完整數據集上訓練的官方SFT模型相當，在AlpacaEval和Wildbench上分別提升了+5.73%和+6.89%。

English

Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.

MIG：通過語義空間中的信息增益最大化實現指令微調的自動數據選擇

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

摘要

Support