SDQM：面向目标检测数据集评估的合成数据质量度量

摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、高质量标注数据集的稀缺性给构建鲁棒模型带来了重大挑战。为解决这一问题，通过模拟和生成模型产生的合成数据已成为一种颇具前景的解决方案，它能够增强数据集的多样性，提升模型的性能、可靠性和抗干扰能力。然而，评估这类生成数据的质量需要有效的度量标准。本文提出了合成数据集质量度量（SDQM），用于评估面向目标检测任务的数据质量，且无需等待模型训练收敛。该度量标准能够更高效地生成和筛选合成数据集，有效应对资源受限的目标检测任务中的关键挑战。在我们的实验中，SDQM与领先的目标检测模型YOLOv11的平均精度（mAP）得分表现出强相关性，而以往的度量标准仅呈现中等或弱相关性。此外，SDQM还为提升数据集质量提供了可操作的见解，最大限度地减少了成本高昂的迭代训练需求。这一可扩展且高效的度量标准为评估合成数据设立了新标杆。SDQM的代码已发布于https://github.com/ayushzenith/SDQM。

English

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

SDQM：面向目标检测数据集评估的合成数据质量度量

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

摘要

Support