다양한 도메인에서의 텍스트 창의성 평가: 데이터셋과 대규모 언어 모델 평가자

초록

창의성 평가는 대규모 언어 모델(LLM)을 위한 여전히 도전적인 과제로 남아 있습니다. 현재의 평가 방식은 비효율적이고 비용이 많이 드는 인간의 판단에 크게 의존하고 있어, 기계의 창의성을 향상시키는 데 있어 진전을 방해하고 있습니다. 자동화된 방법들, 예를 들어 심리학적 테스트부터 휴리스틱 기반 또는 프롬프트 기반 접근법까지 존재하지만, 이러한 방법들은 종종 일반화가 부족하거나 인간의 판단과 일치하지 않는 경우가 많습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 텍스트 창의성을 평가하기 위한 새로운 쌍별 비교(pairwise-comparison) 프레임워크를 제안하며, 공유된 문맥 지침을 활용하여 평가의 일관성을 개선합니다. 우리는 CreataSet이라는 대규모 데이터셋을 소개하는데, 이는 다양한 개방형 도메인 작업에 걸쳐 10만 개 이상의 인간 수준 데이터와 100만 개 이상의 합성 창의적 지침-응답 쌍을 포함하고 있습니다. CreataSet을 기반으로 학습하여, 우리는 CrEval이라는 LLM 기반 평가자를 개발했습니다. CrEval은 인간의 판단과의 일치성 측면에서 기존 방법들을 크게 능가하는 우수성을 보여줍니다. 실험 결과는 고도로 견고한 평가자를 학습시키기 위해 인간이 생성한 데이터와 합성 데이터를 통합하는 것이 필수적임을 강조하며, CrEval이 LLM의 창의성을 향상시키는 데 있어 실용적인 유용성을 입증합니다. 우리는 모든 데이터, 코드, 그리고 모델을 곧 공개하여 추가 연구를 지원할 예정입니다.

English

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

다양한 도메인에서의 텍스트 창의성 평가: 데이터셋과 대규모 언어 모델 평가자

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

초록

Support