IOPO: 入出力選好最適化を通じて複雑な命令に従うLLMsを強化する

要旨

大規模言語モデル（LLMs）の領域では、モデルが正確に指示に従う能力が重要です。ますます多くのエージェントやアプリケーションがLLMsを活用する中で、指示の複雑さが急速に増しています。しかし、一方で、複雑な指示の評価データは限られており、他方で、複雑な指示に従う能力を向上させるための専用のアルゴリズムが存在しません。この論文では、この問題に対処するために、120Kのトレーニングデータと1Kの評価データからなる複雑な指示に従う能力を向上させ、評価するためのベンチマークであるTRACEを紹介します。さらに、入出力優先最適化（IOPO）アライメント手法を提案し、入出力の優先ペアを考慮に入れます。ここでは、LLMsは迅速に応答の優先順位に合わせるだけでなく、指示の優先順位を細心に探索します。ドメイン内およびドメイン外のデータセットに関する包括的な実験により、IOPOの効果が確認され、SFTおよびDPOと比較して、ドメイン内データでは8.15％、2.18％、ドメイン外データでは6.29％、3.13％の改善が示されました。

English

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

IOPO: 入出力選好最適化を通じて複雑な命令に従うLLMsを強化する

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

要旨

Support