시간적 자기 보상 언어 모델: 과거-미래를 통한 선택-거부 분리

초록

자기 보상 언어 모델(Self-Rewarding Language Models)은 대규모 언어 모델(LLMs)이 응답을 생성함과 동시에 LLM-as-a-Judge 프롬프팅을 통해 자신의 출력을 평가하고, 반복적인 직접 선호 최적화(Direct Preference Optimization, DPO)를 통해 생성 능력을 동적으로 개선하는 아키텍처를 제안한다. 그러나 우리의 분석은 기존 자기 보상 패러다임의 중요한 한계를 밝혀냈다: 선택된 응답과 거부된 응답의 동기화된 개선은 대조 샘플 간의 표현적 차이를 점점 좁혀 효과적인 선호 학습을 저해한다. 우리는 학습 신호를 유지하기 위해 과거, 현재, 미래의 모델 생성을 전략적으로 조율하는 시간적 자기 보상 언어 모델(Temporal Self-Rewarding Language Models)을 제안한다. 우리의 이중 단계 프레임워크는 (1) 과거 초기 모델의 출력을 사용하여 거부된 응답을 고정하는 '고정된 거부'(Anchored Rejection)와 (2) 차세대 모델 예측을 사용하여 선택된 샘플을 동적으로 선별하는 '미래 지향적 선택'(Future-Guided Chosen)을 도입한다. 세 가지 모델 패밀리(Llama, Qwen, Mistral)와 다양한 모델 크기(Llama3B/8B/70B)에 걸친 광범위한 실험은 동일한 계산 자원을 사용한 자기 보상 방식과 비교하여 우리의 방법으로 훈련했을 때 상당한 개선을 보여준다. 예를 들어, Llama3.1-8B는 우리의 방법으로 AlpacaEval 2.0에서 29.44의 승률을 달성하며, 자기 보상 기준선(19.69)을 9.75점 앞질렀다. 특히, 우리의 방법은 수학적 추론(GSM8K), 지식 기반 질의응답(ARC, TruthfulQA), 코드 생성(HumanEval) 작업에서도 우수한 분포 외 일반화 능력을 보였으며, 이러한 훈련 데이터를 특별히 수집하지 않았음에도 불구하고 그러한 결과를 보였다.

English

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model's outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

시간적 자기 보상 언어 모델: 과거-미래를 통한 선택-거부 분리

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

초록

Support