複雜指令遵循的反向偏好優化

摘要

指令遵循（Instruction Following, IF）是大型语言模型（LLMs）的一项关键能力。然而，处理具有多重约束的复杂指令仍然具有挑战性。以往的方法通常基于满足约束的数量来选择偏好对，这引入了噪声，因为所选示例可能未能遵循某些约束，而被拒绝的示例在某些方面可能优于所选示例。为了应对与多重偏好对齐的挑战，我们提出了一种简单而有效的方法，称为反向偏好优化（Reverse Preference Optimization, RPO）。它通过动态反转指令中的约束来减轻偏好对中的噪声，确保所选响应是完美的，从而减轻了广泛采样和过滤以收集完美响应的负担。此外，反转还扩大了所选响应与被拒绝响应之间的差距，从而明确了优化方向，使其对噪声更具鲁棒性。我们在两个多轮IF基准测试Sysbench和Multi-IF上评估了RPO，分别展示了相对于DPO基线的平均提升4.6和2.5分（在Llama-3.1 8B上）。此外，RPO在模型规模（8B到70B参数）上有效扩展，70B的RPO模型超越了GPT-4o。

English

Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.

複雜指令遵循的反向偏好優化

Reverse Preference Optimization for Complex Instruction Following

摘要

Support