跨领域文本创意评估：数据集与大语言模型评估器

摘要

创造力评估仍是大型语言模型（LLMs）面临的一大挑战。当前的评估方法过度依赖效率低下且成本高昂的人工判断，阻碍了机器创造力提升的进程。尽管存在从心理测试到启发式或提示引导的自动化方法，它们往往缺乏普适性或与人类判断的一致性。为解决这些问题，本文提出了一种新颖的成对比较框架，用于评估文本创造力，通过共享上下文指令提升评估一致性。我们引入了CreataSet，一个包含10万+人类级别及100万+合成创意指令-响应对的大规模数据集，覆盖多样化的开放域任务。通过在CreataSet上的训练，我们开发了名为CrEval的基于LLM的评估器。CrEval在与人类判断的一致性上展现出显著优势，超越了现有方法。实验结果强调了融合人类生成数据与合成数据在训练高鲁棒性评估器中的不可或缺性，并展示了CrEval在提升LLMs创造力方面的实际应用价值。我们将很快公开所有数据、代码及模型，以支持进一步研究。

English

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

跨领域文本创意评估：数据集与大语言模型评估器

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

摘要

Support