自行设定基准（BIY）：为散点图相关任务准备数据集并评估AI模型

摘要

AI模型在数据分析和可视化中的应用日益增多，然而基准测试鲜少针对散点图特定任务，这限制了对模型性能的深入理解。为填补这一常见图表类型的空白，我们引入了一个包含超过18,000个散点图的合成标注数据集，这些散点图来自六种数据生成器和十七种图表设计，并基于此建立了一个基准测试。我们评估了OpenAI和Google的专有模型，在五个源自聚类边界框、其中心坐标及离群点坐标标注的独特任务上，采用N-shot提示法进行测试。OpenAI模型及Gemini 2.5 Flash，特别是在提供示例提示时，对于聚类计数任务表现出色，而Flash在离群点识别上更是达到了90%以上的准确率。然而，在定位相关任务上的结果不尽如人意：除Flash在离群点识别上达到65.01%外，其他模型的精确率和召回率均接近或低于50%。此外，图表设计对性能的影响虽为次要因素，但建议避免使用宽高比过大（16:9及21:9）或颜色随机分配的散点图。补充材料可于https://github.com/feedzai/biy-paper获取。

English

AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.