元轉換器：多模態學習的統一框架

摘要

多模式學習旨在建立能夠處理和關聯來自多個模態的資訊的模型。儘管這個領域已有多年的發展，但由於它們之間固有的差距，設計一個統一的網路來處理各種模式（例如自然語言、2D 圖像、3D 點雲、音訊、視訊、時間序列、表格數據）仍然具有挑戰性。在這項工作中，我們提出了一個名為 Meta-Transformer 的框架，利用凍結的編碼器來執行多模式感知，而無需任何配對的多模式訓練數據。在 Meta-Transformer 中，來自各種模式的原始輸入數據被映射到共享的標記空間，使得後續具有凍結參數的編碼器能夠提取輸入數據的高層語義特徵。由統一的數據標記器、一個模式共享的編碼器和用於下游任務的任務特定頭部組成，Meta-Transformer 是第一個能夠使用未配對數據在 12 種模式間進行統一學習的框架。在不同基準測試上的實驗顯示，Meta-Transformer 能夠處理各種任務，包括基礎感知（文本、圖像、點雲、音訊、視訊）、實際應用（X 射線、紅外線、高光譜和 IMU）以及數據挖掘（圖形、表格和時間序列）。Meta-Transformer 為使用變壓器開發統一多模式智能指示了一個有前途的未來。代碼將在 https://github.com/invictus717/MetaTransformer 提供。

English

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (e.g. natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a frozen encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

元轉換器：多模態學習的統一框架

Meta-Transformer: A Unified Framework for Multimodal Learning

摘要

Support