复杂指令跟随的反向偏好优化

摘要

指令跟随（IF）是大型语言模型（LLMs）的一项关键能力。然而，处理包含多重约束的复杂指令仍具挑战性。以往的方法通常根据满足约束的数量来选择偏好对，这引入了噪声，即选中的示例可能未能遵循某些约束，而被拒绝的示例在某些方面可能优于选中的示例。为应对多重偏好对齐的挑战，我们提出了一种简单而有效的方法——反向偏好优化（RPO）。该方法通过动态反转指令中的约束来减少偏好对中的噪声，确保选中的响应完美无缺，从而减轻了广泛采样和筛选以收集完美响应的负担。此外，反转还扩大了选中与被拒绝响应之间的差距，从而明确了优化方向，使其对噪声更具鲁棒性。我们在两个多轮IF基准测试Sysbench和Multi-IF上评估了RPO，相较于DPO基线，在Llama-3.1 8B模型上分别平均提升了4.6和2.5分。此外，RPO在不同模型规模（8B至70B参数）上均表现出良好的扩展性，其中70B参数的RPO模型超越了GPT-4o。

English

Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.

复杂指令跟随的反向偏好优化

Reverse Preference Optimization for Complex Instruction Following

摘要

Support