極簡白盒Transformer中分割的出現

摘要

最近，類似Transformer的模型在視覺任務上已被證實對於許多下游應用（如分割和檢測）非常有效。先前的研究表明，在使用自監督方法（如DINO）訓練的視覺Transformer（ViTs）中出現了分割特性，但在受監督分類任務訓練的模型中並未出現。本研究探討了基於Transformer的模型中分割是否僅僅是由於複雜的自監督學習機制而出現，或者是否可以通過適當設計模型架構在更廣泛的條件下實現相同的出現。通過大量的實驗結果，我們展示了當使用一種名為CRATE的白盒Transformer-like架構時，該設計明確地對數據分佈中的低維結構進行建模和追求時，分割特性在整體和部分層面上已經在最簡單的受監督訓練配方下出現。逐層細粒度分析顯示，出現的特性與白盒網絡的設計數學函數強烈一致。我們的結果表明了設計白盒基礎模型的途徑，這些模型既性能卓越又在數學上完全可解釋。代碼位於https://github.com/Ma-Lab-Berkeley/CRATE。

English

Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms, or if the same emergence can be achieved under much broader conditions through proper design of the model architecture. Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.

極簡白盒Transformer中分割的出現

Emergence of Segmentation with Minimalistic White-Box Transformers

摘要

Support