章節-Llama：利用大型語言模型實現長達一小時影片的高效章節劃分

摘要

我們致力於解決視頻章節劃分的任務，即將長視頻的時間線分割成語義單元並生成相應的章節標題。儘管這一領域相對未被充分探索，自動章節劃分具有提升長視頻導航和內容檢索效率的潛力。在本文中，我們通過在文本領域高效處理這一問題，利用我們的「Chapter-Llama」框架，在長達一小時的視頻上實現了強勁的章節劃分性能。具體而言，我們利用了一個具有大上下文窗口的預訓練大型語言模型（LLM），並將（i）語音轉錄文本和（ii）描述視頻幀的標題，以及它們各自的時間戳作為輸入。考慮到對所有幀進行詳盡標註的低效性，我們提出了一種基於語音轉錄內容的輕量級語音引導幀選擇策略，並通過實驗展示了顯著的優勢。我們訓練LLM輸出章節邊界的時間戳以及自由形式的章節標題。這種簡單而強大的方法能夠在單次前向傳播中處理長達一小時的視頻。我們的結果顯示，在最新的VidChapters-7M基準上，相較於現有技術，我們取得了顯著的改進（例如，F1分數從26.7提升至45.3）。為了促進進一步的研究，我們在項目頁面上發布了我們的代碼和模型。

English

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

章節-Llama：利用大型語言模型實現長達一小時影片的高效章節劃分

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

摘要

Support