卷積集合變換器

摘要

我們介紹了卷積集合變換器（Convolutional Set Transformer, CST），這是一種新穎的神經網絡架構，旨在處理視覺上異質但共享高層次語義（如共同類別、場景或概念）的任意基數圖像集合。現有的集合輸入網絡，例如深度集合（Deep Sets）和集合變換器（Set Transformer），僅限於向量輸入，無法直接處理三維圖像張量。因此，它們必須與特徵提取器（通常是卷積神經網絡，CNN）級聯，將圖像編碼為嵌入，然後集合輸入網絡才能建模圖像間的關係。相比之下，CST直接操作於三維圖像張量，同時進行特徵提取和上下文建模，從而實現了這兩個過程的協同效應。這種設計在集合分類和集合異常檢測等任務中表現出優越的性能，並進一步提供了與CNN可解釋性方法（如Grad-CAM）的天然兼容性，這與其他不透明的競爭方法形成對比。最後，我們展示了CST可以在大規模數據集上進行預訓練，並通過標準的遷移學習方案適應新的領域和任務。為了支持進一步的研究，我們發布了CST-15，這是一個在ImageNet上預訓練的CST骨幹網絡（https://github.com/chinefed/convolutional-set-transformer）。

English

We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).