ChatPaper.aiChatPaper

戏剧基准:面向剧本续写的六维评估框架

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

December 22, 2025
作者: Shijian Ma, Yunqi Huang, Yan Lin
cs.AI

摘要

现有基准无法全面评估剧本续写模型在角色一致性维护、情节连贯推进及戏剧结构保持等方面的能力。我们推出DramaBench——首个大规模剧本续写评估基准,从格式规范、叙事效率、角色一致性、情感深度、逻辑一致性和冲突处理六个独立维度进行测评。该框架结合基于规则的分析、大模型标注与统计指标,确保评估的客观性与可复现性。我们对8个前沿语言模型进行了1103个剧本(总计8824次评估)的综合测试,采用严格统计显著性检验(252组配对比较,65.9%具显著性)及人工验证(188个剧本,5个维度中3个达到显著一致性)。消融实验证实六个维度均能捕捉独立质量特征(平均|r|=0.020)。DramaBench可为模型改进提供具针对性的分维度反馈,并为创意写作评估建立严谨标准。
English
Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
PDF174February 8, 2026