DuPO：通过双重偏好优化实现可靠的大语言模型自我验证

摘要

我们提出了DuPO，一种基于双重学习的偏好优化框架，它通过广义对偶性生成无需标注的反馈。DuPO解决了两个关键限制：一是“带可验证奖励的强化学习”（RLVR）对昂贵标签的依赖及其仅适用于可验证任务的局限性；二是传统双重学习仅限于严格对偶任务对（如翻译与回译）的约束。具体而言，DuPO将主任务的输入分解为已知与未知部分，随后构建其双重任务，利用主任务输出及已知信息重建未知部分（例如，通过逆向数学解恢复隐藏变量），从而拓宽了其应用于非可逆任务的范围。这种重建的质量作为自监督奖励，用于优化主任务，与大型语言模型（LLMs）通过单一模型实例化双重任务的能力相辅相成。实验表明，DuPO在多样化任务上取得了显著提升：在756个翻译方向上平均提升了2.13个COMET分数，在三个数学推理挑战基准上平均提高了6.4个百分点，作为推理时重排序器（以计算换取准确性）提升了9.3个百分点。这些成果确立了DuPO作为一个可扩展、通用且无需标注的LLM优化范式的地位。

English

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

DuPO：通过双重偏好优化实现可靠的大语言模型自我验证

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

摘要

Support