ChatPaper.aiChatPaper

基于评分标准引导合成数据的可配置偏好调优

Configurable Preference Tuning with Rubric-Guided Synthetic Data

June 13, 2025
作者: Víctor Gallego
cs.AI

摘要

人类反馈模型在AI对齐中的应用,如直接偏好优化(DPO)所依赖的模型,通常固化了一套单一且静态的偏好集合,限制了其适应性。本文通过引入可配置偏好调优(CPT),挑战了这种单一偏好的假设。CPT是一种新颖的框架,旨在赋予语言模型根据明确、人类可理解的指令动态调整其行为的能力。CPT利用基于系统提示合成的偏好数据,这些提示源自结构化、细粒度的评分标准,定义了诸如写作风格等期望属性。通过使用这些由评分标准引导的偏好进行微调,大型语言模型(LLM)能够在推理时根据系统提示调整其输出,而无需重新训练。这种方法不仅提供了精细的控制,还为建模更加细致和上下文相关的人类反馈提供了一种机制。多项实验成果,包括训练代码、生成的数据集及微调模型,已在https://github.com/vicgalle/configurable-preference-tuning上公开发布。
English
Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
PDF22June 16, 2025