MMG-Ego4D: 자기 중심적 행동 인식에서의 다중 모달 일반화

초록

본 논문에서는 우리가 "다중모달 일반화(Multimodal Generalization, MMG)"라고 명명한 새로운 문제를 자기 중심적 행동 인식(egocentric action recognition)의 맥락에서 연구합니다. MMG는 특정 모달리티의 데이터가 제한적이거나 완전히 누락된 상황에서 시스템이 어떻게 일반화할 수 있는지를 연구하는 것을 목표로 합니다. 우리는 MMG를 표준 지도 학습 행동 인식과 더 도전적인 소수 샷 학습(few-shot learning) 설정에서 새로운 행동 범주를 학습하는 맥락에서 철저히 조사합니다. MMG는 실제 응용 프로그램에서 보안과 효율성을 고려하기 위해 설계된 두 가지 새로운 시나리오로 구성됩니다: (1) 훈련 시간에는 존재했던 일부 모달리티가 추론 시간에는 누락된 상황에서의 일반화, 그리고 (2) 추론 시간과 훈련 시간에 존재하는 모달리티가 서로 겹치지 않는 교차 모달 제로샷 일반화(cross-modal zero-shot generalization). 이 연구를 가능하게 하기 위해, 우리는 비디오, 오디오, 관성 운동 센서(IMU) 모달리티를 포함한 데이터 포인트로 구성된 새로운 데이터셋 MMG-Ego4D를 구축했습니다. 우리의 데이터셋은 Ego4D 데이터셋에서 파생되었지만, MMG 문제 연구를 용이하게 하기 위해 인간 전문가에 의해 처리되고 철저히 재주석 처리되었습니다. 우리는 MMG-Ego4D에서 다양한 모델을 평가하고 개선된 일반화 능력을 가진 새로운 방법들을 제안합니다. 특히, 우리는 모달리티 드롭아웃 훈련, 대조 기반 정렬 훈련(contrastive-based alignment training), 그리고 더 나은 소수 샷 성능을 위한 새로운 교차 모달 프로토타입 손실(cross-modal prototypical loss)을 포함한 새로운 융합 모듈을 소개합니다. 우리는 이 연구가 다중모달 일반화 문제에 대한 벤치마크로 활용되고 향후 연구를 안내하는 데 기여하기를 바랍니다. 벤치마크와 코드는 https://github.com/facebookresearch/MMG_Ego4D에서 제공될 예정입니다.

English

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D.

MMG-Ego4D: 자기 중심적 행동 인식에서의 다중 모달 일반화

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

초록

Support