SDQM：用於物件檢測數據集評估的合成數據質量指標

摘要

機器學習模型的性能在很大程度上依賴於訓練數據。大規模、高質量標註數據集的稀缺性，為構建魯棒模型帶來了重大挑戰。為解決這一問題，通過模擬和生成模型產生的合成數據已成為一種頗具前景的解決方案，它能夠增強數據集的多樣性，並提升模型的性能、可靠性和抗干擾能力。然而，評估這類生成數據的質量需要一個有效的度量標準。本文提出了合成數據集質量度量標準（SDQM），用於在無需模型訓練收斂的情況下，評估物體檢測任務中的數據質量。該度量標準使得合成數據集的生成和選擇更加高效，解決了資源受限的物體檢測任務中的一個關鍵挑戰。在我們的實驗中，SDQM與領先的物體檢測模型YOLOv11的平均精度均值（mAP）分數表現出強相關性，而以往的度量標準僅呈現出中等或弱相關性。此外，它還為提升數據集質量提供了可操作的見解，最大限度地減少了成本高昂的迭代訓練需求。這一可擴展且高效的度量標準為評估合成數據樹立了新標杆。SDQM的代碼已公開於https://github.com/ayushzenith/SDQM。

English

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

SDQM：用於物件檢測數據集評估的合成數據質量指標

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

摘要

Support