少即是多：在大型语言模型特征空间中合成多样化数据

摘要

后训练数据的多样性对于大语言模型（LLMs）的下游任务性能至关重要。现有构建后训练数据的方法大多通过基于文本的指标来量化多样性，这些指标虽能捕捉语言变异，但对决定下游性能的任务相关特征仅能提供微弱信号。本研究提出特征激活覆盖度（FAC），在可解释的特征空间中度量数据多样性。基于此指标，我们进一步提出名为FAC合成的多样性驱动数据生成框架：首先使用稀疏自编码器识别种子数据集中缺失的特征，随后显式生成反映这些特征的合成样本。实验表明，我们的方法在指令遵循、毒性检测、奖励建模和行为引导等多项任务中，持续提升数据多样性与下游性能。有趣的是，我们发现不同模型家族（如LLaMA、Mistral和Qwen）间存在共享的可解释特征空间，实现了跨模型知识迁移。本研究为探索LLMs的数据中心化优化提供了坚实且实用的方法论。

English

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

少即是多：在大型语言模型特征空间中合成多样化数据

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

摘要

Support