CroCo: 자가 생성에 기반한 교차 언어 대조 선호 조정

초록

선행 연구는 대규모 언어 모델이 스스로 생성한 응답 간의 통제된 대비성(contrastiveness)을 보상 점수를 통해 설정함으로써 영어에서의 하위 선호도 튜닝(preference tuning)을 개선할 수 있음을 입증하였다. 본 연구는 이 방법을 다중 언어로 확장하여, 총 14개의 고자원 및 저자원 언어에 걸쳐 두 가지 모델을 다양한 과제에서 평가한다. 핵심 발견은 자체 생성에 대한 교차언어 대비 선호도 튜닝(CroCo)이 언어별 선호도 주석 없이도 전이된다는 점이다. 영어 선호도(다국어 기반 모델 위에 구축)로 학습된 보상 모델은 대부분의 언어에서 유용한 언어 내 순위를 생성하며, 단일 언어 또는 다중 언어 환경에서의 짝짓기는 대다수 설정에서 각 모델을 개선시키면서 지도 미세조정의 치명적 망각(catastrophic forgetting)을 방지한다. 이러한 이점은 온-폴리시(on-policy) 데이터에 의존함을 확인하였다. 오프-폴리시(off-policy) 응답은 이점을 감소시키며, 온라인 선호도 최적화는 오프라인 변형보다 개선되지 않는다. 구체적으로, 구조화된 과제에서 본 방법은 EuroLLM-9B의 경우 7개 언어 중 6개, Aya-3B의 경우 7개 설정 중 4개에서 기준 모델과 일치하거나 이를 능가한다. 개방형 생성에서는 두 튜닝 모델 모두 평가된 11개 언어 전반에서 각각의 기준 모델보다 우수한 성능을 보였다. 전반적으로, 본 연구는 다중 언어 선호도 튜닝의 유망한 방향을 제시한다.

English

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.