준지도 학습과 비전 트랜스포머를 활용한 세분화된 분류를 위한 전이 학습

초록

세밀한 분류(fine-grained classification)는 동일한 범주 내 객체들 간의 미묘한 차이를 식별해야 하는 어려운 과제입니다. 이 작업은 특히 데이터가 부족한 시나리오에서 더욱 도전적입니다. 최근 시각 트랜스포머(Visual Transformer, ViT)는 자기 주의(self-attention) 메커니즘을 통해 시각 데이터의 높은 표현력을 학습할 수 있는 능력으로 인해 이미지 분류를 위한 강력한 도구로 부상했습니다. 본 연구에서는 주석이 달린 데이터가 부족한 상황에 적합한, 준지도 학습(semi-supervised learning) 기법을 사용하여 미세 조정된 ViT 모델인 Semi-ViT를 탐구합니다. 이는 특히 전자상거래 분야에서 흔히 발생하는데, 이미지는 쉽게 구할 수 있지만 레이블은 노이즈가 있거나 존재하지 않거나 얻는 데 비용이 많이 드는 경우가 많기 때문입니다. 우리의 실험 결과는 Semi-ViT가 제한된 주석 데이터로 미세 조정된 경우에도 기존의 합성곱 신경망(CNN)과 ViT를 능가함을 보여줍니다. 이러한 발견은 시각 데이터의 정밀하고 세밀한 분류가 필요한 응용 분야에서 Semi-ViT가 상당한 잠재력을 가지고 있음을 시사합니다.

English

Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques, suitable for situations where we have lack of annotated data. This is particularly common in e-commerce, where images are readily available but labels are noisy, nonexistent, or expensive to obtain. Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data. These findings indicate that Semi-ViTs hold significant promise for applications that require precise and fine-grained classification of visual data.

준지도 학습과 비전 트랜스포머를 활용한 세분화된 분류를 위한 전이 학습

Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers

초록

Support