CodecLM：利用定制合成数据对齐语言模型

摘要

指导调整已成为将大型语言模型（LLMs）与特定任务指令对齐的关键，从而减轻下一个标记预测目标与用户实际目标之间的差异。为了减少人类收集或注释数据的劳动力和时间成本，研究人员开始探索使用LLMs生成与指令对齐的合成数据。最近的研究侧重于生成多样化指令并应用LLM增加指令复杂性，通常忽略了下游用例。如何量身定制高质量数据以引发不同目标指令分布和LLMs中更好的指令遵循能力仍不清楚。为此，我们引入CodecLM，这是一个通用框架，用于自适应生成适用于不同下游指令分布和LLMs的高质量合成数据，以实现LLM对齐。借鉴编码-解码原则，我们使用LLMs作为编解码器来指导数据生成过程。我们首先将种子指令编码为元数据，这些元数据是即时生成的简洁关键词，用于捕捉目标指令分布，然后解码元数据以创建量身定制的指令。我们还在解码过程中引入了自我评分和对比过滤，以量身定制高效数据样本。在四个开放领域的指令遵循基准测试上进行的大量实验验证了CodecLM相对于当前最先进技术的有效性。

English

Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.

CodecLM：利用定制合成数据对齐语言模型

CodecLM: Aligning Language Models with Tailored Synthetic Data

摘要

Support