루브릭 기반 합성 데이터를 활용한 구성 가능한 선호도 튜닝

초록

인공지능 정렬을 위한 인간 피드백 모델, 예를 들어 직접 선호 최적화(Direct Preference Optimization, DPO)를 기반으로 하는 모델들은 종종 단일적이고 정적인 선호 집합을 내포함으로써 적응성을 제한한다. 본 논문은 이러한 단일적 선호 가정에 도전하며, 언어 모델이 명시적이고 인간이 해석 가능한 지시에 따라 동적으로 행동을 조정할 수 있는 능력을 부여하는 새로운 프레임워크인 구성 가능 선호 조정(Configurable Preference Tuning, CPT)을 소개한다. CPT는 원하는 속성(예: 글쓰기 스타일)을 정의하는 구조화되고 세분화된 루브릭에서 도출된 시스템 프롬프트에 조건화된 합성 선호 데이터를 활용한다. 이러한 루브릭 기반 선호를 통해 미세 조정함으로써, 대형 언어 모델(LLM)은 재학습 없이도 추론 시 시스템 프롬프트에 따라 출력을 조절하는 방법을 학습한다. 이 접근법은 세밀한 제어를 제공할 뿐만 아니라, 더욱 세밀하고 문맥 의존적인 인간 피드백을 모델링하는 메커니즘을 제공한다. 학습 코드, 생성된 데이터셋, 미세 조정된 모델과 같은 여러 실험 결과물은 https://github.com/vicgalle/configurable-preference-tuning에서 공개되었다.

English

Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

루브릭 기반 합성 데이터를 활용한 구성 가능한 선호도 튜닝

Configurable Preference Tuning with Rubric-Guided Synthetic Data

초록

Support