卷积集合变换器
Convolutional Set Transformer
September 26, 2025
作者: Federico Chinello, Giacomo Boracchi
cs.AI
摘要
我们提出了卷积集合变换器(Convolutional Set Transformer, CST),这是一种新颖的神经网络架构,旨在处理视觉上异质但共享高层语义(如共同类别、场景或概念)的任意基数图像集合。现有的集合输入网络,例如深度集合(Deep Sets)和集合变换器(Set Transformer),仅限于处理向量输入,无法直接处理三维图像张量。因此,它们必须与特征提取器(通常是卷积神经网络CNN)级联,先将图像编码为嵌入,再由集合输入网络建模图像间关系。相比之下,CST直接操作于三维图像张量,同时执行特征提取和上下文建模,从而实现这两个过程的协同效应。这一设计在集合分类和集合异常检测等任务中表现出色,并且与CNN可解释性方法(如Grad-CAM)天然兼容,而其他竞争方法则仍显不透明。最后,我们展示了CST可以在大规模数据集上进行预训练,并通过标准的迁移学习方案适应新的领域和任务。为了支持进一步研究,我们发布了CST-15,这是一个在ImageNet上预训练的CST骨干网络(https://github.com/chinefed/convolutional-set-transformer)。
English
We introduce the Convolutional Set Transformer (CST), a novel neural
architecture designed to process image sets of arbitrary cardinality that are
visually heterogeneous yet share high-level semantics - such as a common
category, scene, or concept. Existing set-input networks, e.g., Deep Sets and
Set Transformer, are limited to vector inputs and cannot directly handle 3D
image tensors. As a result, they must be cascaded with a feature extractor,
typically a CNN, which encodes images into embeddings before the set-input
network can model inter-image relationships. In contrast, CST operates directly
on 3D image tensors, performing feature extraction and contextual modeling
simultaneously, thereby enabling synergies between the two processes. This
design yields superior performance in tasks such as Set Classification and Set
Anomaly Detection and further provides native compatibility with CNN
explainability methods such as Grad-CAM, unlike competing approaches that
remain opaque. Finally, we show that CSTs can be pre-trained on large-scale
datasets and subsequently adapted to new domains and tasks through standard
Transfer Learning schemes. To support further research, we release CST-15, a
CST backbone pre-trained on ImageNet
(https://github.com/chinefed/convolutional-set-transformer).