Chapter-Llama: 長時間動画におけるLLMを用いた効率的な章分け

要旨

長時間動画のチャプター分割、すなわち動画タイムラインを意味的な単位に分割し、対応するチャプタータイトルを生成するタスクに取り組みます。比較的未開拓の領域である自動チャプター分割は、長時間動画における効率的なナビゲーションとコンテンツ検索を可能にする潜在能力を秘めています。本論文では、'Chapter-Llama'フレームワークを用いてテキスト領域でこの問題に効率的に取り組むことで、1時間以上の長時間動画において優れたチャプター分割性能を達成します。具体的には、大規模なコンテキストウィンドウを持つ事前学習済み大規模言語モデル（LLM）を活用し、(i)音声書き起こしと(ii)ビデオフレームを説明するキャプション、およびそれぞれのタイムスタンプを入力として与えます。すべてのフレームにキャプションを付ける非効率性を考慮し、音声書き起こしの内容に基づいた軽量な音声誘導フレーム選択戦略を提案し、実験的にその顕著な利点を実証します。LLMを、チャプター境界のタイムスタンプと自由形式のチャプタータイトルを出力するように訓練します。このシンプルでありながら強力なアプローチにより、1時間の長時間動画を単一のフォワードパスで処理することが可能になります。最新のVidChapters-7Mベンチマークにおいて、従来の最先端技術と比較して大幅な改善（例：45.3 vs 26.7 F1スコア）を実証しました。さらなる研究を促進するため、プロジェクトページでコードとモデルを公開しています。

English

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Chapter-Llama: 長時間動画におけるLLMを用いた効率的な章分け

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

要旨

Support