半教師あり学習と視覚トランスフォーマーを用いた細粒度分類のための転移学習

要旨

細粒度分類は、同一カテゴリ内のオブジェクト間の微妙な差異を識別するという困難なタスクです。このタスクは、特にデータが不足しているシナリオにおいて非常に困難です。視覚トランスフォーマー（ViT）は、自己注意機構を用いて視覚データの高度に表現力のある表現を学習する能力により、最近画像分類の強力なツールとして登場しました。本研究では、注釈付きデータが不足している状況に適した、半教師あり学習技術を用いてファインチューニングされたViTモデルであるSemi-ViTを探求します。これは特に電子商取引において一般的で、画像は容易に入手可能ですが、ラベルはノイズが多い、存在しない、または取得にコストがかかる場合があります。我々の結果は、Semi-ViTが、限られた注釈付きデータでファインチューニングされた場合でも、従来の畳み込みニューラルネットワーク（CNN）やViTを上回ることを示しています。これらの発見は、Semi-ViTが視覚データの精密かつ細粒度の分類を必要とするアプリケーションにおいて大きな可能性を秘めていることを示唆しています。

English

Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques, suitable for situations where we have lack of annotated data. This is particularly common in e-commerce, where images are readily available but labels are noisy, nonexistent, or expensive to obtain. Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data. These findings indicate that Semi-ViTs hold significant promise for applications that require precise and fine-grained classification of visual data.

半教師あり学習と視覚トランスフォーマーを用いた細粒度分類のための転移学習

Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers

要旨

Support