複雑な指示追従のための逆選好最適化

要旨

指示追従（Instruction Following, IF）は、大規模言語モデル（LLMs）にとって重要な能力である。しかし、複数の制約を伴う複雑な指示を扱うことは依然として課題となっている。従来の手法では、通常、満たす制約の数に基づいて選好ペアを選択するが、選ばれた例が一部の制約を満たさない場合や、拒否された例が選ばれた例よりも特定の点で優れている場合にノイズが生じる。複数の選好を整合させるという課題に対処するため、我々はReverse Preference Optimization（RPO）と呼ばれるシンプルかつ効果的な手法を提案する。RPOは、指示内の制約を動的に反転させることで、選ばれた応答が完璧であることを保証し、完璧な応答を収集するための広範なサンプリングとフィルタリングの負担を軽減する。さらに、反転は選ばれた応答と拒否された応答の間のギャップを拡大し、最適化の方向を明確にし、ノイズに対するロバスト性を高める。我々はRPOを2つのマルチターンIFベンチマーク、SysbenchとMulti-IFで評価し、DPOベースラインに対してそれぞれ4.6ポイントと2.5ポイント（Llama-3.1 8Bにおいて）の平均的な改善を示した。さらに、RPOはモデルサイズ（8Bから70Bパラメータ）にわたって効果的にスケールし、70BのRPOモデルはGPT-4oを上回った。

English

Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.

複雑な指示追従のための逆選好最適化

Reverse Preference Optimization for Complex Instruction Following

要旨

Support