Mirasol3B: 시간 정렬 및 맥락적 양상을 위한 다중 양식 자동회귀 모델

초록

멀티모달 학습의 주요 과제 중 하나는 이질적인 모달리티(예: 비디오, 오디오, 텍스트)를 결합해야 한다는 점이다. 예를 들어, 비디오와 오디오는 텍스트보다 훨씬 높은 속도로 획득되며 시간적으로 대략적으로 정렬된다. 이들은 종종 제목이나 설명과 같은 전역 컨텍스트로 제공되는 텍스트와 동기화되지 않는다. 또한, 비디오와 오디오 입력은 훨씬 더 큰 용량을 가지며 비디오 길이가 증가함에 따라 그 크기도 커지는데, 이는 자연스럽게 이러한 모달리티에 더 많은 계산 자원을 할당해야 하며 장기간의 의존성을 모델링하기 어렵게 만든다. 여기서 우리는 멀티모달 모델링을 분리하여 각 모달리티의 특성에 따라 입력을 처리하는 별도의 자율회귀 모델로 나눈다. 우리는 Mirasol3B라는 멀티모달 모델을 제안하는데, 이 모델은 시간적으로 동기화된 모달리티(오디오와 비디오)를 위한 자율회귀 구성 요소와 시간적으로 반드시 정렬되지는 않지만 여전히 순차적인 컨텍스트 모달리티를 위한 자율회귀 구성 요소로 구성된다. 비디오-오디오 입력의 긴 시퀀스를 해결하기 위해, 우리는 비디오와 오디오 시퀀스를 연속적인 스니펫으로 더 분할하고 그 표현을 자율회귀적으로 처리할 것을 제안한다. 이를 위해, 우리는 특정 시간 프레임 내에서 오디오-비디오 정보를 공동으로 모델링하는 Combiner 메커니즘을 제안한다. Combiner는 원시 시공간 신호에서 오디오와 비디오 특징을 추출하는 방법을 학습한 다음, 이러한 특징을 융합하여 스니펫당 간결하지만 표현력 있는 표현을 생성하는 방법을 학습한다. 우리의 접근 방식은 잘 확립된 멀티모달 벤치마크에서 최첨단 성능을 달성하며, 훨씬 더 큰 모델을 능가한다. 이는 미디어 입력의 높은 계산 요구를 효과적으로 해결하는데, 이는 간결한 표현을 학습하고, 오디오-비디오 특징 표현의 시퀀스 길이를 제어하며, 시간적 의존성을 모델링함으로써 이루어진다.

English

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Mirasol3B: 시간 정렬 및 맥락적 양상을 위한 다중 양식 자동회귀 모델

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

초록

Support