对大型语言模型进行训练后调整，以支持多样化创意写作

摘要

由于创意写作任务并无唯一正确答案，执行此类任务的大型语言模型（LLMs）应能生成多样且有效的输出。然而，LLM的后训练往往侧重于提升生成质量，却忽视了促进输出多样性。因此，在创意写作生成中，我们探索了旨在同时提升输出多样性与质量的后训练方法。我们的核心思想是将“偏差”——即同一提示下训练样本与所有其他样本间的差异程度——纳入训练目标，以促进从罕见高质量实例中学习。通过将我们的方法应用于直接偏好优化（DPO）和几率比偏好优化（ORPO），我们证明了在最小化质量下降的同时，能够提升训练模型的输出多样性。我们拥有80亿参数的最佳模型，其输出多样性可与人类创建的数据集相媲美，同时输出质量接近我们所考察的最佳指令调优模型——GPT-4o和DeepSeek-R1。我们进一步通过人类评估、消融实验以及与现有多样化方法DivPO的对比，验证了我们的方法。

English

As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.

对大型语言模型进行训练后调整，以支持多样化创意写作

Modifying Large Language Model Post-Training for Diverse Creative Writing

摘要

Support