多様なドメインにおけるテキストの創造性評価：データセットと大規模言語モデル評価器

要旨

大規模言語モデル（LLM）における創造性評価は、依然として挑戦的な未開拓領域である。現在の評価手法は、非効率的でコストのかかる人間の判断に大きく依存しており、機械の創造性を向上させる進展を妨げている。自動化された手法は存在するものの、心理学的テストからヒューリスティックやプロンプトベースのアプローチまで幅広く、それらはしばしば一般化が難しく、人間の判断との整合性に欠ける。これらの課題に対処するため、本論文では、テキストの創造性を評価するための新たなペアワイズ比較フレームワークを提案し、共有された文脈指示を活用して評価の一貫性を向上させる。また、CreataSetという大規模データセットを導入する。このデータセットは、10万以上の人間レベルの指示-応答ペアと100万以上の合成された創造的な指示-応答ペアを含み、多様なオープンドメインタスクにわたっている。CreataSetを用いて学習を行い、LLMベースの評価ツールであるCrEvalを開発した。CrEvalは、人間の判断との整合性において既存の手法を大幅に上回る優位性を示している。実験結果は、高度に頑健な評価ツールを訓練するために人間が生成したデータと合成データの両方を統合することが不可欠であることを強調し、CrEvalがLLMの創造性を向上させる実用的な有用性を示している。今後の研究を支援するため、すべてのデータ、コード、モデルを近日中に公開する予定である。

English

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

多様なドメインにおけるテキストの創造性評価：データセットと大規模言語モデル評価器

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

要旨

Support