DuPO: 이중 선호도 최적화를 통한 신뢰할 수 있는 LLM 자가 검증 활성화

초록

본 논문에서는 일반화된 이중성(duality)을 통해 주석 없는 피드백을 생성하는 이중 학습 기반 선호 최적화 프레임워크인 DuPO를 제안한다. DuPO는 두 가지 주요 한계를 해결한다: 첫째, 검증 가능한 보상을 통한 강화 학습(RLVR)이 비용이 많이 드는 레이블에 의존하며 검증 가능한 작업에만 적용 가능하다는 점, 둘째, 전통적인 이중 학습이 엄격한 이중 작업 쌍(예: 번역 및 역번역)에만 제한된다는 점이다. 구체적으로, DuPO는 주 작업의 입력을 알려진 부분과 알려지지 않은 부분으로 분해한 후, 이중 작업을 구성하여 주 작업의 출력과 알려진 정보를 사용하여 알려지지 않은 부분을 재구성한다(예: 수학 문제 해결을 역으로 수행하여 숨겨진 변수를 복구). 이를 통해 비가역적 작업에도 적용 범위를 확장한다. 이 재구성의 품질은 주 작업을 최적화하기 위한 자기 지도 학습 보상으로 작용하며, 단일 모델을 통해 두 작업을 모두 인스턴스화할 수 있는 대형 언어 모델(LLM)의 능력과 시너지를 낸다. 실험적으로, DuPO는 다양한 작업에서 상당한 성능 향상을 달성했다: 756개 방향에서 평균 번역 품질을 2.13 COMET 향상시켰으며, 세 가지 수학적 추론 벤치마크에서 평균 6.4점의 정확도 향상을 보였고, 추론 시간 재순위 지정기로서 9.3점의 성능 향상을 달성했다(정확도를 위해 계산을 희생). 이러한 결과는 DuPO를 LLM 최적화를 위한 확장 가능하고 일반적이며 주석 없는 패러다임으로 자리매김한다.

English

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

DuPO: 이중 선호도 최적화를 통한 신뢰할 수 있는 LLM 자가 검증 활성화

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

초록

Support