ChatPaper.aiChatPaper

在網絡噪聲數據集中,僅僅準確檢測並不足以對抗標籤噪聲。

An accurate detection is not all you need to combat label noise in web-noisy datasets

July 8, 2024
作者: Paul Albert, Jack Valmadre, Eric Arazo, Tarun Krishna, Noel E. O'Connor, Kevin McGuinness
cs.AI

摘要

在網路爬蟲數據上訓練分類器需要學習算法能夠抵抗標註錯誤和無關的範例。本文建立在最近的實證觀察之上,即將非監督對比學習應用於嘈雜的網路爬蟲數據集,可以產生一種特徵表示,使得分布內(ID)和分布外(OOD)樣本可以線性可分。我們展示直接估計分離超平面確實可以準確檢測OOD樣本,然而令人驚訝的是,這種檢測並未轉化為分類準確性的提升。深入研究這一現象,我們發現這種近乎完美的檢測錯過了一類對監督學習有價值的乾淨範例。這些範例通常代表視覺上簡單的圖像,使用標準的損失或距離方法往往可以輕鬆識別為乾淨範例,儘管在非監督學習中與OOD分布之間的分離不佳。由於我們進一步觀察到與SOTA指標的低相關性,這促使我們提出一種混合解決方案,交替使用線性分離進行噪聲檢測和最先進的小損失方法。當與SOTA算法PLS結合時,我們在存在網路噪聲的真實圖像分類中顯著改善了SOTA結果。GitHub網址:github.com/PaulAlbert31/LSA
English
Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise github.com/PaulAlbert31/LSA

Summary

AI-Generated Summary

PDF44November 28, 2024