卷积集合变换器

摘要

我们提出了卷积集合变换器（Convolutional Set Transformer, CST），这是一种新颖的神经网络架构，旨在处理视觉上异质但共享高层语义（如共同类别、场景或概念）的任意基数图像集合。现有的集合输入网络，例如深度集合（Deep Sets）和集合变换器（Set Transformer），仅限于处理向量输入，无法直接处理三维图像张量。因此，它们必须与特征提取器（通常是卷积神经网络CNN）级联，先将图像编码为嵌入，再由集合输入网络建模图像间关系。相比之下，CST直接操作于三维图像张量，同时执行特征提取和上下文建模，从而实现这两个过程的协同效应。这一设计在集合分类和集合异常检测等任务中表现出色，并且与CNN可解释性方法（如Grad-CAM）天然兼容，而其他竞争方法则仍显不透明。最后，我们展示了CST可以在大规模数据集上进行预训练，并通过标准的迁移学习方案适应新的领域和任务。为了支持进一步研究，我们发布了CST-15，这是一个在ImageNet上预训练的CST骨干网络（https://github.com/chinefed/convolutional-set-transformer）。

English

We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).