Vidi: 비디오 이해 및 편집을 위한 대규모 멀티모달 모델

초록

인간은 자연스럽게 연결된 사람들과 정보를 공유하며, 비디오는 인터넷에서 의사소통과 표현을 위한 주요 매체 중 하나로 자리 잡았습니다. 고품질의 대규모 비디오 콘텐츠 생성을 지원하기 위해 현대적인 파이프라인은 원시 입력 자료(예: 카메라로 촬영된 편집되지 않은 영상)와 편집 구성 요소(예: 시각 효과)에 대한 포괄적인 이해를 필요로 합니다. 비디오 편집 시나리오에서 모델은 강력한 배경 지식을 바탕으로 여러 모달리티(예: 비전, 오디오, 텍스트)를 처리하고 유연한 입력 길이(예: 시간 단위의 원본 비디오)를 다뤄야 하며, 이는 전통적인 모델에게 상당한 도전 과제를 제기합니다. 본 보고서에서는 다양한 비디오 이해 및 편집 시나리오를 위한 대규모 멀티모달 모델(LMM) 패밀리인 Vidi를 소개합니다. 첫 번째 릴리스는 텍스트 쿼리에 해당하는 입력 비디오 내의 시간 범위를 식별하는 시간적 검색에 초점을 맞추며, 이는 지능형 편집에서 중요한 역할을 합니다. 이 모델은 시간 단위의 비디오를 처리할 수 있으며, 특정 쿼리에 대한 시간 범위를 검색하는 등 강력한 시간적 이해 능력을 갖추고 있습니다. 실제 시나리오에서 포괄적인 평가를 지원하기 위해 VUE-TR 벤치마크도 제시하며, 이는 다섯 가지 주요 개선 사항을 도입했습니다. 1) 비디오 지속 시간: 기존 시간적 검색 데이터셋보다 상당히 길고, 2) 오디오 지원: 오디오 기반 쿼리를 포함하며, 3) 쿼리 형식: 다양한 길이와 형식의 쿼리, 4) 주석 품질: 실제 시간 범위가 수동으로 주석 처리됨, 5) 평가 지표: 여러 시간 범위에 걸친 평가를 지원하는 개선된 IoU 지표. 특히, Vidi는 시간적 검색 작업에서 GPT-4o 및 Gemini와 같은 선도적인 독점 모델을 크게 능가하며, 비디오 편집 시나리오에서의 우수성을 입증했습니다.

English

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Vidi: 비디오 이해 및 편집을 위한 대규모 멀티모달 모델

Vidi: Large Multimodal Models for Video Understanding and Editing

초록

Support