ChatPaper.aiChatPaper

VisJudge-Bench:可视化作品的美学与质量评估体系

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

October 25, 2025
作者: Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo
cs.AI

摘要

可视化作为一种领域特定但广泛应用的图像形式,是将复杂数据集转化为直观洞见的重要手段,其价值取决于能否忠实呈现数据、清晰传递信息并具备美学设计。然而,可视化质量评估具有挑战性:与自然图像不同,它需要同时考量数据编码准确性、信息表达清晰度和视觉美学表现。尽管多模态大语言模型在自然图像美学评估中展现出潜力,但目前缺乏系统性基准来衡量其在可视化评估方面的能力。为此,我们提出VisJudge-Bench——首个用于评估MLLMs可视化美学与质量分析能力的综合基准。该基准包含3,090个来自真实场景的专家标注样本,涵盖32种图表类型,涉及单图、多图和仪表板三种场景。系统性测试表明,即使最先进的MLLMs(如GPT-5)在判断力上仍与人类专家存在显著差距,其平均绝对误差达0.551,与人类评分相关性仅为0.429。针对此问题,我们提出专用于可视化美学与质量评估的模型VisJudge。实验结果显示,VisJudge显著缩小了与人类判断的差距:相较于GPT-5,平均绝对误差降低至0.442(降幅19.8%),与人类专家的一致性提升至0.681(增幅58.7%)。该基准已发布于https://github.com/HKUSTDial/VisJudgeBench。
English
Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.
PDF141December 1, 2025