CodecLM: カスタマイズされた合成データによる言語モデルのアラインメント

要旨

命令チューニングは、大規模言語モデル（LLM）を特定のタスク命令に適合させるための鍵として登場し、次のトークン予測という目的とユーザーの実際の目標との間の不一致を緩和しています。人間によるデータ収集や注釈付けの労力と時間コストを削減するため、研究者たちはLLMを利用して命令に沿った合成データを生成する方法を探り始めています。最近の研究では、多様な命令を生成し、LLMを適用して命令の複雑さを増すことに焦点が当てられていますが、下流のユースケースがしばしば無視されています。異なるターゲット命令分布やLLMにおいて、より良い命令追従能力を引き出すために高品質なデータをどのように調整すべきかはまだ明らかではありません。この目的のために、我々はCodecLMを導入します。これは、異なる下流命令分布やLLMに適応的に高品質な合成データを生成するための一般的なフレームワークです。エンコード・デコードの原則に基づき、LLMをコーデックとして利用し、データ生成プロセスをガイドします。まず、シード命令をメタデータにエンコードします。メタデータは、ターゲット命令分布を捕捉するためにその場で生成される簡潔なキーワードです。次に、メタデータをデコードして、調整された命令を作成します。また、デコード中にSelf-RubricsとContrastive Filteringを導入し、データ効率の良いサンプルを調整します。4つのオープンドメイン命令追従ベンチマークでの広範な実験により、CodecLMの有効性が現在の最先端技術を上回ることが検証されました。

English

Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.

CodecLM: カスタマイズされた合成データによる言語モデルのアラインメント

CodecLM: Aligning Language Models with Tailored Synthetic Data

要旨

Support