最小白盒变换器中分割的出现

摘要

最近，类似Transformer的视觉模型已被证明在诸如分割和检测等各种下游应用中非常有效。先前的研究表明，使用自监督方法（如DINO）训练的视觉Transformer（ViTs）中会出现分割属性，但在受监督分类任务训练的模型中却没有。本研究探讨了基于Transformer的模型中是否仅通过复杂的自监督学习机制才会出现分割，或者通过适当设计模型架构就能在更广泛的条件下实现相同的出现。通过大量实验结果，我们展示了当使用一种名为CRATE的白盒Transformer-like架构时，该架构明确地模拟并追求数据分布中的低维结构，分割属性在进行了极简监督训练方案后就会在整体和部分层面上出现。逐层细粒度分析显示，出现的属性与白盒网络的设计数学函数强烈一致。我们的结果表明了设计白盒基础模型的途径，这些模型既具有高性能，又在数学上完全可解释。代码位于https://github.com/Ma-Lab-Berkeley/CRATE。

English

Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms, or if the same emergence can be achieved under much broader conditions through proper design of the model architecture. Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.

最小白盒变换器中分割的出现

Emergence of Segmentation with Minimalistic White-Box Transformers

摘要

Support