Chapter-Llama: 대형 언어 모델을 활용한 장시간 비디오의 효율적 챕터 분할

초록

우리는 긴 비디오 타임라인을 의미론적 단위로 분할하고 해당 챕터 제목을 생성하는 비디오 챕터링 작업을 다룹니다. 비교적 덜 탐구된 자동 챕터링은 장편 비디오에서 효율적인 탐색과 콘텐츠 검색을 가능하게 할 잠재력을 가지고 있습니다. 본 논문에서는 'Chapter-Llama' 프레임워크를 통해 텍스트 영역에서 이 문제를 효율적으로 해결함으로써 시간 단위의 긴 비디오에서 강력한 챕터링 성능을 달성합니다. 구체적으로, 우리는 대규모 컨텍스트 윈도우를 가진 사전 훈련된 대형 언어 모델(LLM)을 활용하고, (i) 음성 전사본과 (ii) 비디오 프레임을 설명하는 캡션을 각각의 타임스탬프와 함께 입력으로 제공합니다. 모든 프레임을 포괄적으로 캡션 처리하는 비효율성을 고려하여, 우리는 음성 전사 내용을 기반으로 한 경량의 음성 가이드 프레임 선택 전략을 제안하고, 실험적으로 뛰어난 장점을 입증합니다. 우리는 LLM을 챕터 경계에 대한 타임스탬프와 자유 형식의 챕터 제목을 출력하도록 훈련시킵니다. 이 간단하지만 강력한 접근 방식은 단일 순방향 패스로 1시간 길이의 비디오를 처리할 수 있도록 확장됩니다. 우리의 결과는 최근 VidChapters-7M 벤치마크에서 기존 최신 기술 대비 상당한 개선(예: 45.3 대 26.7 F1 점수)을 보여줍니다. 추가 연구를 촉진하기 위해, 우리는 프로젝트 페이지에서 코드와 모델을 공개합니다.

English

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Chapter-Llama: 대형 언어 모델을 활용한 장시간 비디오의 효율적 챕터 분할

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

초록

Support