ChatPaper.aiChatPaper

在网络嘈杂数据集中,准确的检测并不是对抗标签噪声所需的全部。

An accurate detection is not all you need to combat label noise in web-noisy datasets

July 8, 2024
作者: Paul Albert, Jack Valmadre, Eric Arazo, Tarun Krishna, Noel E. O'Connor, Kevin McGuinness
cs.AI

摘要

在网络抓取数据上训练分类器需要学习算法能够抵御注释错误和无关示例。本文基于最近的经验观察,指出将无监督对比学习应用于嘈杂的网络抓取数据集可以产生一个特征表示,使得分布内(ID)和分布外(OOD)样本在其下线性可分。我们展示了直接估计分离超平面确实能够准确检测OOD样本,然而令人惊讶的是,这种检测并没有转化为分类准确性的提升。深入研究这一现象,我们发现几乎完美的检测错过了一类对监督学习有价值的干净示例。这些示例通常代表视觉简单的图像,使用标准损失或基于距离的方法很容易识别为干净示例,尽管它们在无监督学习中与OOD分布之间的分离较差。由于我们进一步观察到与SOTA指标的低相关性,这促使我们提出一种混合解决方案,交替使用线性分离进行噪声检测和最先进的(SOTA)小损失方法。当与SOTA算法PLS结合时,我们显著改善了存在网络噪声情况下的真实世界图像分类的SOTA结果 github.com/PaulAlbert31/LSA
English
Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise github.com/PaulAlbert31/LSA

Summary

AI-Generated Summary

PDF44November 28, 2024