多模态路径：利用来自其他模态的无关数据改进Transformer

摘要

我们提出通过利用来自其他模态的无关数据来改进特定模态的Transformer，例如，改进一个ImageNet模型，使用音频或点云数据集。我们想强调目标模态的数据样本与其他模态无关，这将我们的方法与利用配对数据（例如CLIP）或不同模态的交错数据的其他工作区分开。我们提出了一种名为多模态路径的方法 - 针对目标模态和为其设计的Transformer，我们使用用另一模态的数据训练的辅助Transformer，并构建路径来连接两个模型的组件，以便目标模态的数据可以被两个模型处理。通过这种方式，我们利用了从两个模态获得的Transformer的通用序列到序列建模能力。作为一个具体的实现，我们像往常一样使用特定于模态的分词器和任务特定的头部，但通过一种名为跨模态重新参数化的方法利用辅助模型的Transformer块，这种方法利用了辅助权重而没有任何推理成本。在图像、点云、视频和音频识别任务中，我们观察到通过来自其他模态的无关数据显著且一致的性能改进。代码和模型可在https://github.com/AILab-CVC/M2PT找到。

English

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

多模态路径：利用来自其他模态的无关数据改进Transformer

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

摘要

Support