ChatPaper.aiChatPaper

DramaBench:一個面向劇本續寫的六維度評估框架

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

December 22, 2025
作者: Shijian Ma, Yunqi Huang, Yan Lin
cs.AI

摘要

劇本續寫任務要求模型保持角色一致性、推進情節連貫性並維持戲劇結構——這些能力是現有基準測試未能全面評估的。我們推出DramaBench,首個大規模多維度劇本續寫基準,涵蓋六個獨立評估維度:格式規範、敘事效率、角色一致性、情感深度、邏輯一致性與衝突處理。該框架結合基於規則的分析、大語言模型標註與統計指標,確保評估的客觀性與可重現性。我們對8個前沿語言模型進行了1,103個劇本(共8,824次評估)的綜合測試,採用嚴格的統計顯著性檢驗(252組配對比較,65.9%具顯著性)並進行人工驗證(188個劇本,在3/5維度上達成顯著一致性)。消融實驗證實六個維度均捕捉獨立質量特徵(平均|r|=0.020)。DramaBench為模型改進提供具可操作性的分維度反饋,為創意寫作評估確立了嚴謹標準。
English
Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
PDF174February 8, 2026