多模式路徑：利用來自其他模態的無關數據來改進Transformer

摘要

我們提議通過來自其他模態的無關數據來改進特定模態的Transformer，例如，使用音頻或點雲數據集來改進ImageNet模型。我們希望強調目標模態的數據樣本與其他模態無關，這將我們的方法與利用配對數據（例如CLIP）或不同模態的交錯數據的其他作品區分開來。我們提出了一種名為多模態通道的方法 - 针对目标模态和设计用于其的Transformer，我们使用另一模态数据訓練的輔助Transformer，并構建路径來連接兩個模型的組件，以便目標模態的數據可以被兩個模型處理。通過這種方式，我們利用了從兩個模態獲得的Transformer的通用序列到序列建模能力。作為具體實現，我們通常使用特定模態的標記器和任務特定的頭部，但通過一種名為跨模態重新參數化的方法利用輔助模型的Transformer塊，該方法利用輔助權重而無需任何推理成本。在圖像、點雲、視頻和音頻識別任務中，我們觀察到使用來自其他模態的無關數據會帶來顯著且一致的性能改善。代碼和模型可在https://github.com/AILab-CVC/M2PT找到。

English

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

多模式路徑：利用來自其他模態的無關數據來改進Transformer

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

摘要

Support