ChatPaper.aiChatPaper

可配置偏好調校與評分標準引導的合成數據

Configurable Preference Tuning with Rubric-Guided Synthetic Data

June 13, 2025
作者: Víctor Gallego
cs.AI

摘要

人類反饋模型在AI對齊中的應用,例如支撐直接偏好優化(DPO)的那些模型,通常內嵌了一套單一且靜態的偏好集,這限制了其適應性。本文通過引入可配置偏好調優(CPT),挑戰了這一偏好單一性的假設。CPT是一種新穎的框架,旨在賦予語言模型根據明確、人類可理解的指令動態調整其行為的能力。CPT利用基於系統提示合成的偏好數據,這些提示源自於結構化、細粒度的評分標準,這些標準定義了諸如寫作風格等期望屬性。通過使用這些由評分標準引導的偏好進行微調,大型語言模型(LLM)學會在推理時根據系統提示調節其輸出,而無需重新訓練。這種方法不僅提供了細粒度的控制,還為建模更細膩且依賴於上下文的人類反饋提供了一種機制。多項實驗成果,包括訓練代碼、生成的數據集及微調後的模型,已發佈於https://github.com/vicgalle/configurable-preference-tuning。
English
Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
PDF22June 16, 2025