畳み込み集合トランスフォーマー

要旨

本論文では、視覚的に異質でありながら高レベルの意味（共通のカテゴリ、シーン、または概念など）を共有する任意のカーディナリティの画像セットを処理するために設計された新しいニューラルアーキテクチャ、Convolutional Set Transformer（CST）を紹介する。既存のセット入力ネットワーク（例：Deep SetsやSet Transformer）はベクトル入力に限定されており、3D画像テンソルを直接扱うことができない。その結果、これらのネットワークは通常CNNなどの特徴抽出器とカスケード接続する必要があり、セット入力ネットワークが画像間の関係をモデル化する前に画像を埋め込みにエンコードしなければならない。一方、CSTは3D画像テンソルを直接操作し、特徴抽出と文脈モデリングを同時に行うことで、両プロセスの相乗効果を可能にする。この設計により、セット分類やセット異常検出などのタスクで優れた性能を発揮し、さらにGrad-CAMなどのCNNの説明可能性手法とのネイティブな互換性を提供する。これは、不透明なままの競合手法とは対照的である。最後に、CSTは大規模データセットで事前学習し、標準的な転移学習スキームを通じて新しいドメインやタスクに適応できることを示す。さらなる研究を支援するため、ImageNetで事前学習されたCSTバックボーンであるCST-15を公開する（https://github.com/chinefed/convolutional-set-transformer）。

English

We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).