針對多樣化創意寫作的大型語言模型訓練後調整

摘要

由於創意寫作任務並無單一正確答案，專精於此的大型語言模型（LLMs）應能生成多樣且有效的輸出。然而，LLM的後續訓練往往側重於提升生成質量，卻忽視了促進輸出多樣性。因此，在創意寫作生成領域，我們探討了旨在同時提升輸出多樣性與質量的後續訓練方法。我們的核心思想是將偏差——即訓練樣本與同一提示下所有其他樣本之間的差異程度——納入訓練目標，以促進從罕見高質量實例中學習。通過將我們的方法應用於直接偏好優化（DPO）和勝率偏好優化（ORPO），我們展示了在最小化質量下降的同時，能夠提升訓練模型的輸出多樣性。我們的最佳模型擁有80億參數，其多樣性可與人類創建的數據集相媲美，同時輸出質量接近我們所考察的最佳指令調優模型，即GPT-4o和DeepSeek-R1。我們進一步通過人類評估、消融實驗以及與現有多樣化方法DivPO的對比，驗證了我們方法的有效性。

English

As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.

針對多樣化創意寫作的大型語言模型訓練後調整

Modifying Large Language Model Post-Training for Diverse Creative Writing

摘要

Support