DuPO：通過雙重偏好優化實現可靠的大型語言模型自我驗證

摘要

我們提出了DuPO，這是一個基於雙重學習的偏好優化框架，它通過廣義對偶性生成無需註釋的反饋。DuPO解決了兩個關鍵限制：一是“帶可驗證獎勵的強化學習（RLVR）”對昂貴標籤的依賴及其僅適用於可驗證任務的限制；二是傳統雙重學習僅限於嚴格雙重任務對（如翻譯與回譯）的約束。具體而言，DuPO將主任務的輸入分解為已知與未知部分，隨後構建其對偶任務，利用主任務輸出及已知信息重建未知部分（例如，反轉數學解答以恢復隱藏變量），從而拓寬了對非可逆任務的適用性。此重建的質量作為自監督獎勵來優化主任務，與大語言模型（LLMs）通過單一模型實例化雙任務的能力相得益彰。實驗表明，DuPO在多樣任務上取得了顯著提升：在756個翻譯方向上平均提升了2.13 COMET的翻譯質量，在三個數學推理挑戰基準上平均提高了6.4個百分點的準確率，並作為推理時重排序器提升了9.3個百分點的性能（以計算換取精度）。這些成果使DuPO成為一種可擴展、通用且無需註釋的大語言模型優化範式。

English

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

DuPO：通過雙重偏好優化實現可靠的大型語言模型自我驗證

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

摘要

Support