HAIC: 멀티모달 대형 언어 모델을 위한 개선된 캡션을 통한 인간 행동 이해 및 생성 능력 향상

초록

최근 멀티모달 대형 언어 모델(MLLMs)은 비디오 이해 분야에서 큰 진전을 이루었습니다. 그러나 인간 행동이 포함된 비디오에 대한 성능은 여전히 고품질 데이터의 부족으로 제한되고 있습니다. 이를 해결하기 위해, 우리는 두 단계의 데이터 주석 파이프라인을 도입했습니다. 첫째, 인터넷에서 명확한 인간 행동을 보여주는 비디오를 축적하기 위한 전략을 설계했습니다. 둘째, 비디오는 인간 속성을 사용하여 개인을 구분하고 시간 순서대로 그들의 행동과 상호작용을 상세히 기술하는 표준화된 캡션 형식으로 주석이 달렸습니다. 이 파이프라인을 통해 우리는 HAICTrain과 HAICBench이라는 두 가지 데이터셋을 구축했습니다. HAICTrain은 Gemini-Pro에 의해 생성되고 훈련 목적으로 검증된 126K개의 비디오-캡션 쌍으로 구성됩니다. 한편, HAICBench은 인간 행동 이해를 종합적으로 평가하기 위해 수동으로 주석이 달린 500개의 비디오-캡션 쌍과 1,400개의 QA 쌍을 포함합니다. 실험 결과는 HAICTrain으로 훈련하는 것이 4개의 벤치마크에서 인간 이해 능력을 크게 향상시킬 뿐만 아니라, 텍스트-투-비디오 생성 결과도 개선할 수 있음을 보여줍니다. HAICTrain과 HAICBench 모두 https://huggingface.co/datasets/KuaishouHAIC/HAIC에서 공개되었습니다.

English

Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. HAICTrain comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, HAICBench includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.