評估跨領域文本創造力：數據集與大型語言模型評測工具

摘要

創造力評估仍然是大型語言模型（LLMs）面臨的一個挑戰性前沿。目前的評估方法嚴重依賴低效且成本高昂的人類判斷，這阻礙了提升機器創造力的進展。雖然存在從心理測試到啟發式或提示驅動方法等自動化評估手段，但它們往往缺乏普適性或與人類判斷的一致性。為解決這些問題，本文提出了一種新穎的成對比較框架來評估文本創造力，利用共享的上下文指令來提高評估的一致性。我們引入了CreataSet，這是一個大規模數據集，包含超過10萬條人類級別和100萬條以上合成的創意指令-響應對，涵蓋多樣化的開放域任務。通過在CreataSet上訓練，我們開發了一款基於LLM的評估器，名為CrEval。CrEval在與人類判斷的一致性方面展現出顯著優於現有方法的卓越性能。實驗結果強調了整合人類生成數據與合成數據在訓練高魯棒性評估器中的不可或缺的重要性，並展示了CrEval在提升LLMs創造力方面的實際應用價值。我們將很快公開所有數據、代碼和模型，以支持進一步的研究。

English

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

評估跨領域文本創造力：數據集與大型語言模型評測工具

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

摘要

Support