Social-MAE：基于Transformer的多模态自编码器，用于面部与语音融合

摘要

人类社交行为本质上是多模态的，这要求开发强大的视听模型以进行感知。本文中，我们提出了Social-MAE，这是一种基于扩展版对比视听掩码自编码器（CAV-MAE）的预训练视听模型，该模型在社交视听数据上进行了预训练。具体而言，我们改进了CAV-MAE，使其能够接收更多帧作为输入，并在大规模人类社交互动数据集（VoxCeleb2）上以自监督方式进行预训练。我们通过在多种社交与情感下游任务——即情绪识别、笑声检测及表面性格评估——上微调并评估该模型，证明了其有效性。该模型在多模态情绪识别和笑声识别上取得了最先进的成果，在表面性格评估上也展现了竞争力，充分证明了领域内自监督预训练的有效性。代码及模型权重可在此处获取：https://github.com/HuBohy/SocialMAE。

English

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

Social-MAE：基于Transformer的多模态自编码器，用于面部与语音融合

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

摘要

Support