최소한의 화이트박스 트랜스포머를 통한 세그멘테이션의 등장

초록

비전 작업을 위한 Transformer 유사 모델들은 최근 세그멘테이션 및 탐지와 같은 다양한 다운스트림 애플리케이션에서 효과적인 것으로 입증되었습니다. 이전 연구들은 DINO와 같은 자기 지도 학습 방법으로 훈련된 비전 트랜스포머(ViTs)에서 세그멘테이션 특성이 나타나지만, 지도 분류 작업으로 훈련된 모델에서는 그렇지 않음을 보여주었습니다. 본 연구에서는 세그멘테이션이 복잡한 자기 지도 학습 메커니즘의 결과로만 트랜스포머 기반 모델에서 나타나는지, 아니면 모델 아키텍처의 적절한 설계를 통해 더 넓은 조건 하에서도 동일한 특성이 나타날 수 있는지를 탐구합니다. 광범위한 실험 결과를 통해, 데이터 분포의 저차원 구조를 명시적으로 모델링하고 추구하는 화이트박스 트랜스포머 유사 아키텍처인 CRATE를 사용할 경우, 최소한의 지도 학습 레시피로도 전체 및 부분 수준에서 세그멘테이션 특성이 이미 나타남을 입증합니다. 계층별 세부 분석은 이러한 특성이 화이트박스 네트워크의 설계된 수학적 함수와 강력하게 일치함을 보여줍니다. 우리의 결과는 동시에 높은 성능과 수학적으로 완전히 해석 가능한 화이트박스 기반 모델을 설계하는 길을 제시합니다. 코드는 https://github.com/Ma-Lab-Berkeley/CRATE에서 확인할 수 있습니다.

English

Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms, or if the same emergence can be achieved under much broader conditions through proper design of the model architecture. Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.

최소한의 화이트박스 트랜스포머를 통한 세그멘테이션의 등장

Emergence of Segmentation with Minimalistic White-Box Transformers

초록

Support