自建基准测试（BIY）：为散点图相关任务准备数据集并评估AI模型

摘要

AI模型在数据分析和可视化中的应用日益广泛，然而现有基准测试鲜少针对散点图特定任务进行评估，这限制了对模型性能的深入理解。为填补这一常见图表类型的空白，我们引入了一个包含六种数据生成器和17种图表设计、超过18,000个标注散点图的合成数据集，并基于此建立了一个基准测试。我们采用N-shot提示法，对来自OpenAI和Google的专有模型在五个基于聚类边界框、中心坐标及离群点坐标标注的任务上进行了评估。OpenAI模型和Gemini 2.5 Flash，特别是在提供示例提示的情况下，在聚类计数任务中表现良好，而Flash在离群点识别上更是达到了90%以上的准确率。然而，在定位相关任务上的结果不尽如人意：除Flash在离群点识别上达到65.01%外，精确率和召回率大多接近或低于50%。此外，图表设计对性能的影响虽为次要因素，但建议避免使用宽高比过大（如16:9和21:9）或颜色随机分配的散点图。补充材料可访问https://github.com/feedzai/biy-paper获取。

English

AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.