複雜指令遵循的反向偏好優化
Reverse Preference Optimization for Complex Instruction Following
May 28, 2025
作者: Xiang Huang, Ting-En Lin, Feiteng Fang, Yuchuan Wu, Hangyu Li, Yuzhong Qu, Fei Huang, Yongbin Li
cs.AI
摘要
指令遵循(Instruction Following, IF)是大型语言模型(LLMs)的一项关键能力。然而,处理具有多重约束的复杂指令仍然具有挑战性。以往的方法通常基于满足约束的数量来选择偏好对,这引入了噪声,因为所选示例可能未能遵循某些约束,而被拒绝的示例在某些方面可能优于所选示例。为了应对与多重偏好对齐的挑战,我们提出了一种简单而有效的方法,称为反向偏好优化(Reverse Preference Optimization, RPO)。它通过动态反转指令中的约束来减轻偏好对中的噪声,确保所选响应是完美的,从而减轻了广泛采样和过滤以收集完美响应的负担。此外,反转还扩大了所选响应与被拒绝响应之间的差距,从而明确了优化方向,使其对噪声更具鲁棒性。我们在两个多轮IF基准测试Sysbench和Multi-IF上评估了RPO,分别展示了相对于DPO基线的平均提升4.6和2.5分(在Llama-3.1 8B上)。此外,RPO在模型规模(8B到70B参数)上有效扩展,70B的RPO模型超越了GPT-4o。
English
Instruction following (IF) is a critical capability for large language models
(LLMs). However, handling complex instructions with multiple constraints
remains challenging. Previous methods typically select preference pairs based
on the number of constraints they satisfy, introducing noise where chosen
examples may fail to follow some constraints and rejected examples may excel in
certain respects over the chosen ones. To address the challenge of aligning
with multiple preferences, we propose a simple yet effective method called
Reverse Preference Optimization (RPO). It mitigates noise in preference pairs
by dynamically reversing the constraints within the instruction to ensure the
chosen response is perfect, alleviating the burden of extensive sampling and
filtering to collect perfect responses. Besides, reversal also enlarges the gap
between chosen and rejected responses, thereby clarifying the optimization
direction and making it more robust to noise. We evaluate RPO on two multi-turn
IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over
the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively.
Moreover, RPO scales effectively across model sizes (8B to 70B parameters),
with the 70B RPO model surpassing GPT-4o.Summary
AI-Generated Summary