LiveCC: 대규모 스트리밍 음성 전사를 활용한 비디오 LLM 학습

초록

최근의 비디오 대형 언어 모델(Video LLMs)은 종종 비용이 많이 드는 인간 주석이나 독점 모델 API(예: GPT-4o)에 의존하여 훈련 데이터를 생성하는데, 이는 대규모 훈련을 제한합니다. 본 논문에서는 저렴한 자동 음성 인식(ASR) 트랜스크립트를 사용하여 Video LLM의 대규모 훈련을 탐구합니다. 구체적으로, 우리는 ASR 단어와 비디오 프레임을 타임스탬프에 따라 밀집하게 인터리브하는 새로운 스트리밍 훈련 방식을 제안합니다. ASR을 사용한 시각-언어 표현에 대한 이전 연구와 비교하여, 우리의 방법은 ASR의 스트리밍 특성에 자연스럽게 적합하여, 모델이 시간적으로 정렬된 세밀한 시각-언어 모델링을 학습할 수 있게 합니다. 이 훈련 알고리즘을 지원하기 위해, 우리는 YouTube 비디오와 그 자막(CC, ASR과 동일)을 처리하여 사전 훈련을 위한 Live-CC-5M 데이터셋과 고품질 지도 미세 조정(SFT)을 위한 Live-WhisperX-526K 데이터셋을 생성하는 데이터 생산 파이프라인을 소개합니다. 주목할 만하게도, SFT 없이도 ASR만으로 사전 훈련된 LiveCC-7B-Base 모델은 일반 비디오 QA 성능에서 경쟁력을 보이며, 실시간 비디오 해설이라는 새로운 능력을 보여줍니다. 이를 평가하기 위해, 우리는 자유 형식 해설을 측정하기 위해 LLM-as-a-judge를 사용하여 새로운 LiveSports-3K 벤치마크를 신중하게 설계했습니다. 실험 결과, 우리의 최종 LiveCC-7B-Instruct 모델은 실시간 모드에서도 고급 72B 모델(Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B)을 해설 품질에서 능가할 수 있음을 보여줍니다. 동시에, VideoMME 및 OVOBench와 같은 인기 있는 비디오 QA 벤치마크에서 7B/8B 규모에서 최신 기술을 달성하여 우리 접근법의 광범위한 일반화 가능성을 입증합니다. 본 논문의 모든 리소스는 https://showlab.github.io/livecc에서 공개되었습니다.

English

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

LiveCC: 대규모 스트리밍 음성 전사를 활용한 비디오 LLM 학습

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

초록

Support