Social-MAE: 얼굴과 음성을 위한 트랜스포머 기반 다중모달 오토인코더

초록

인간의 사회적 행동은 본질적으로 다중 양식(multimodal)을 필요로 하며, 이를 인지하기 위해서는 강력한 오디오-비주얼 모델의 개발이 필수적입니다. 본 논문에서는 오디오-비주얼 사회적 데이터에 사전 학습된 Contrastive Audio-Visual Masked Auto-Encoder(CAV-MAE)의 확장 버전을 기반으로 한 사전 학습된 오디오-비주얼 Masked Autoencoder인 Social-MAE를 제시합니다. 구체적으로, CAV-MAE를 더 많은 프레임을 입력으로 받을 수 있도록 수정하고, 인간의 사회적 상호작용 대규모 데이터셋(VoxCeleb2)에서 자기 지도(self-supervised) 방식으로 사전 학습을 진행했습니다. 이 모델의 효과를 입증하기 위해 다양한 사회적 및 감정 관련 하위 작업(emotion recognition, laughter detection, apparent personality estimation)에 대해 미세 조정(finetuning) 및 평가를 수행했습니다. 그 결과, 이 모델은 다중 양식 감정 인식 및 웃음 인식에서 최첨단(state-of-the-art) 성능을 달성했으며, 외적 성격 추정(apparent personality estimation)에서도 경쟁력 있는 결과를 보여, 도메인 내 자기 지도 사전 학습의 효과를 입증했습니다. 코드와 모델 가중치는 https://github.com/HuBohy/SocialMAE에서 확인할 수 있습니다.

English

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

Social-MAE: 얼굴과 음성을 위한 트랜스포머 기반 다중모달 오토인코더

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

초록

Support