CodecLM：利用定制合成數據對齊語言模型

摘要

指令調整已成為對齊大型語言模型（LLMs）與特定任務指令的關鍵，從而減輕下一個標記預測目標與用戶實際目標之間的差異。為了減少人類收集或標註數據的勞動和時間成本，研究人員開始探索使用LLMs生成與指令對齊的合成數據。最近的研究專注於生成多樣化指令並應用LLM增加指令複雜性，往往忽略了下游用例。如何量身定制高質量數據以引出不同目標指令分佈和LLMs中更好的指令跟隨能力仍不清楚。為此，我們介紹了CodecLM，一個通用框架，用於自適應生成高質量合成數據，以使LLMs與不同下游指令分佈和LLMs對齊。借鑒編碼-解碼原則，我們使用LLMs作為編解碼器來引導數據生成過程。我們首先將種子指令編碼為元數據，這些元數據是即時生成的簡潔關鍵詞，用於捕捉目標指令分佈，然後解碼元數據以創建量身定制的指令。我們還在解碼過程中引入了自我評分和對比過濾，以量身定制高效樣本。對四個開放領域指令跟隨基準進行的大量實驗驗證了CodecLM相對於當前技術水平的有效性。

English

Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.

CodecLM：利用定制合成數據對齊語言模型

CodecLM: Aligning Language Models with Tailored Synthetic Data

摘要

Support