SDQM:用於物件檢測數據集評估的合成數據質量指標
SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
October 8, 2025
作者: Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin
cs.AI
摘要
機器學習模型的性能在很大程度上依賴於訓練數據。大規模、高質量標註數據集的稀缺性,為構建魯棒模型帶來了重大挑戰。為解決這一問題,通過模擬和生成模型產生的合成數據已成為一種頗具前景的解決方案,它能夠增強數據集的多樣性,並提升模型的性能、可靠性和抗干擾能力。然而,評估這類生成數據的質量需要一個有效的度量標準。本文提出了合成數據集質量度量標準(SDQM),用於在無需模型訓練收斂的情況下,評估物體檢測任務中的數據質量。該度量標準使得合成數據集的生成和選擇更加高效,解決了資源受限的物體檢測任務中的一個關鍵挑戰。在我們的實驗中,SDQM與領先的物體檢測模型YOLOv11的平均精度均值(mAP)分數表現出強相關性,而以往的度量標準僅呈現出中等或弱相關性。此外,它還為提升數據集質量提供了可操作的見解,最大限度地減少了成本高昂的迭代訓練需求。這一可擴展且高效的度量標準為評估合成數據樹立了新標杆。SDQM的代碼已公開於https://github.com/ayushzenith/SDQM。
English
The performance of machine learning models depends heavily on training data.
The scarcity of large-scale, well-annotated datasets poses significant
challenges in creating robust models. To address this, synthetic data generated
through simulations and generative models has emerged as a promising solution,
enhancing dataset diversity and improving the performance, reliability, and
resilience of models. However, evaluating the quality of this generated data
requires an effective metric. This paper introduces the Synthetic Dataset
Quality Metric (SDQM) to assess data quality for object detection tasks without
requiring model training to converge. This metric enables more efficient
generation and selection of synthetic datasets, addressing a key challenge in
resource-constrained object detection tasks. In our experiments, SDQM
demonstrated a strong correlation with the mean Average Precision (mAP) scores
of YOLOv11, a leading object detection model, while previous metrics only
exhibited moderate or weak correlations. Additionally, it provides actionable
insights for improving dataset quality, minimizing the need for costly
iterative training. This scalable and efficient metric sets a new standard for
evaluating synthetic data. The code for SDQM is available at
https://github.com/ayushzenith/SDQM