元轉換器:多模態學習的統一框架
Meta-Transformer: A Unified Framework for Multimodal Learning
July 20, 2023
作者: Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue
cs.AI
摘要
多模式學習旨在建立能夠處理和關聯來自多個模態的資訊的模型。儘管這個領域已有多年的發展,但由於它們之間固有的差距,設計一個統一的網路來處理各種模式(例如自然語言、2D 圖像、3D 點雲、音訊、視訊、時間序列、表格數據)仍然具有挑戰性。在這項工作中,我們提出了一個名為 Meta-Transformer 的框架,利用凍結的編碼器來執行多模式感知,而無需任何配對的多模式訓練數據。在 Meta-Transformer 中,來自各種模式的原始輸入數據被映射到共享的標記空間,使得後續具有凍結參數的編碼器能夠提取輸入數據的高層語義特徵。由統一的數據標記器、一個模式共享的編碼器和用於下游任務的任務特定頭部組成,Meta-Transformer 是第一個能夠使用未配對數據在 12 種模式間進行統一學習的框架。在不同基準測試上的實驗顯示,Meta-Transformer 能夠處理各種任務,包括基礎感知(文本、圖像、點雲、音訊、視訊)、實際應用(X 射線、紅外線、高光譜和 IMU)以及數據挖掘(圖形、表格和時間序列)。Meta-Transformer 為使用變壓器開發統一多模式智能指示了一個有前途的未來。代碼將在 https://github.com/invictus717/MetaTransformer 提供。
English
Multimodal learning aims to build models that can process and relate
information from multiple modalities. Despite years of development in this
field, it still remains challenging to design a unified network for processing
various modalities (e.g. natural language, 2D images, 3D point
clouds, audio, video, time series, tabular data) due to the inherent gaps among
them. In this work, we propose a framework, named Meta-Transformer, that
leverages a frozen encoder to perform multimodal perception without
any paired multimodal training data. In Meta-Transformer, the raw input data
from various modalities are mapped into a shared token space, allowing a
subsequent encoder with frozen parameters to extract high-level semantic
features of the input data. Composed of three main components: a unified data
tokenizer, a modality-shared encoder, and task-specific heads for downstream
tasks, Meta-Transformer is the first framework to perform unified learning
across 12 modalities with unpaired data. Experiments on different benchmarks
reveal that Meta-Transformer can handle a wide range of tasks including
fundamental perception (text, image, point cloud, audio, video), practical
application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,
tabular, and time-series). Meta-Transformer indicates a promising future for
developing unified multimodal intelligence with transformers. Code will be
available at https://github.com/invictus717/MetaTransformer