Youku-mPLUG: 사전 학습과 벤치마크를 위한 1천만 규모의 중국어 비디오-언어 데이터셋

초록

중국 커뮤니티 내 비전-언어 사전 학습(Vision-Language Pre-training, VLP) 및 멀티모달 대형 언어 모델(Large Language Model, LLM)의 발전을 촉진하기 위해, 우리는 먼저 중국 최대 규모의 고품질 비디오-언어 데이터셋인 Youku-mPLUG를 공개한다. 이 데이터셋은 중국의 유명 비디오 공유 웹사이트인 Youku에서 수집되었으며, 안전성, 다양성, 품질에 대한 엄격한 기준을 충족한다. Youku-mPLUG는 45개 다양한 카테고리에서 4억 개의 원시 비디오 중 필터링된 1천만 개의 중국어 비디오-텍스트 쌍을 포함하며, 대규모 사전 학습을 위해 구성되었다. 또한, 비디오-언어 모델의 포괄적인 평가를 용이하게 하기 위해, 우리는 교차 모달 검색(cross-modal retrieval), 비디오 캡셔닝(video captioning), 비디오 카테고리 분류(video category classification)라는 세 가지 인기 있는 비디오-언어 작업을 다루는 최대 규모의 인간 주석 중국어 벤치마크를 신중하게 구축했다. Youku-mPLUG는 연구자들이 더 깊이 있는 멀티모달 연구를 수행하고 미래에 더 나은 애플리케이션을 개발할 수 있도록 지원한다. 더불어, 우리는 인기 있는 비디오-언어 사전 학습 모델인 ALPRO와 mPLUG-2, 그리고 Youku-mPLUG에서 사전 학습된 우리가 제안한 모듈화된 디코더 전용 모델인 mPLUG-video를 공개한다. 실험 결과, Youku-mPLUG에서 사전 학습된 모델은 비디오 카테고리 분류에서 최대 23.1%의 성능 향상을 보였다. 또한, mPLUG-video는 이러한 벤치마크에서 비디오 카테고리 분류에서 80.5%의 Top-1 정확도, 비디오 캡셔닝에서 68.9의 CIDEr 점수로 새로운 최첨단 결과를 달성했다. 마지막으로, 우리는 고정된 Bloomz 기반으로 mPLUG-video를 확장하여 단 1.7%의 학습 가능한 매개변수만을 사용한 중국어 멀티모달 LLM으로 구현하고, 인상적인 지시 및 비디오 이해 능력을 입증했다. 제로샷 지시 이해 실험은 Youku-mPLUG로 사전 학습함으로써 전체 및 세부 시각 의미를 이해하고, 장면 텍스트를 인식하며, 개방형 도메인 지식을 활용하는 능력이 향상될 수 있음을 보여준다.

English

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.

Youku-mPLUG: 사전 학습과 벤치마크를 위한 1천만 규모의 중국어 비디오-언어 데이터셋

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

초록

Support