Social-MAE：一種基於Transformer的多模態自編碼器，用於臉部與語音分析

摘要

人類的社會行為本質上是多模態的，這促使我們開發強大的視聽模型來感知這些行為。在本文中，我們提出了Social-MAE，這是一個基於擴展版對比視聽掩碼自編碼器（CAV-MAE）的預訓練視聽模型，該模型在視聽社交數據上進行了預訓練。具體來說，我們修改了CAV-MAE，使其能夠接收更多幀作為輸入，並在人類社交互動的大型數據集（VoxCeleb2）上以自監督的方式進行預訓練。我們通過在不同社交和情感下游任務（即情感識別、笑聲檢測和外顯性格估計）上微調和評估模型，展示了該模型的有效性。該模型在多模態情感識別和笑聲識別上取得了最先進的成果，並在外顯性格估計上取得了競爭力的結果，展示了領域內自監督預訓練的有效性。代碼和模型權重可在這裡獲取：https://github.com/HuBohy/SocialMAE。

English

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

Social-MAE：一種基於Transformer的多模態自編碼器，用於臉部與語音分析

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

摘要

Support