Social-MAE: 顔と音声のためのTransformerベースのマルチモーダルオートエンコーダ

要旨

人間の社会的行動は本質的にマルチモーダルであるため、その知覚のための強力な視聴覚モデルの開発が求められています。本論文では、拡張版のContrastive Audio-Visual Masked Auto-Encoder（CAV-MAE）に基づく事前学習済み視聴覚モデルであるSocial-MAEを提案します。具体的には、CAV-MAEをより多くのフレームを入力として受け取るように修正し、人間の社会的相互作用の大規模データセット（VoxCeleb2）を用いて自己教師あり学習を行いました。このモデルの有効性を、感情認識、笑い検出、外見的パーソナリティ推定といった様々な社会的・感情的下流タスクにおいてファインチューニングと評価を行うことで実証しました。その結果、マルチモーダル感情認識と笑い認識において最先端の結果を達成し、外見的パーソナリティ推定においても競争力のある結果を示し、ドメイン内での自己教師あり事前学習の有効性を実証しました。コードとモデルウェイトはこちらで公開しています https://github.com/HuBohy/SocialMAE。

English

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

Social-MAE: 顔と音声のためのTransformerベースのマルチモーダルオートエンコーダ

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

要旨

Support