可配置偏好調校與評分標準引導的合成數據

摘要

人類反饋模型在AI對齊中的應用，例如支撐直接偏好優化（DPO）的那些模型，通常內嵌了一套單一且靜態的偏好集，這限制了其適應性。本文通過引入可配置偏好調優（CPT），挑戰了這一偏好單一性的假設。CPT是一種新穎的框架，旨在賦予語言模型根據明確、人類可理解的指令動態調整其行為的能力。CPT利用基於系統提示合成的偏好數據，這些提示源自於結構化、細粒度的評分標準，這些標準定義了諸如寫作風格等期望屬性。通過使用這些由評分標準引導的偏好進行微調，大型語言模型（LLM）學會在推理時根據系統提示調節其輸出，而無需重新訓練。這種方法不僅提供了細粒度的控制，還為建模更細膩且依賴於上下文的人類反饋提供了一種機制。多項實驗成果，包括訓練代碼、生成的數據集及微調後的模型，已發佈於https://github.com/vicgalle/configurable-preference-tuning。

English

Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

可配置偏好調校與評分標準引導的合成數據

Configurable Preference Tuning with Rubric-Guided Synthetic Data

摘要

Support