RePO: ReLU 기반 선호 최적화

초록

대규모 언어 모델(LLM)을 인간의 선호도에 맞추는 것은 실제 환경에서의 배포에 있어 매우 중요하지만, RLHF와 같은 기존 방법들은 계산적 복잡성과 안정성 문제에 직면해 있습니다. DPO는 단일 하이퍼파라미터 베타(beta)를 사용하는 오프라인 패러다임을 제시했지만, SimPO와 같은 후속 방법들은 이중 파라미터(베타, 감마)를 도입함으로써 복잡성을 다시 증가시켰습니다. 우리는 {ReLU 기반 선호도 최적화(RePO)}를 제안합니다. 이는 두 가지 혁신을 통해 베타를 제거한 간소화된 알고리즘입니다: (1) SimPO의 참조 없는 마진을 유지하지만 그래디언트 분석을 통해 베타를 제거하고, (2) ReLU 기반의 최대 마진 손실을 채택하여 사소한 쌍을 자연스럽게 필터링합니다. 이론적으로 RePO는 SimPO의 극한 경우(베타가 무한대로 가는 경우)로 특징지어지며, 로지스틱 가중치가 이진 임계값 처리로 축소되어 0-1 손실의 볼록 포락선을 형성합니다. AlpacaEval 2와 Arena-Hard에서의 실험 결과는 RePO가 여러 기본 모델에서 DPO와 SimPO를 능가하며, 단 하나의 하이퍼파라미터만 조정하면 된다는 것을 보여줍니다.

English

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter beta, subsequent methods like SimPO reintroduce complexity through dual parameters (beta, gamma). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates beta via two advances: (1) retaining SimPO's reference-free margins but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (beta to infty), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

RePO: ReLU 기반 선호 최적화

RePO: ReLU-based Preference Optimization

초록

Support