CoS：連鎖式拍攝提示用於長視頻理解

摘要

多模式大型語言模型（MLLMs）在處理長視頻時面臨困難，因為需要大量的視覺標記。這些標記大大超出了MLLMs的上下文長度，導致填充了冗餘的與任務無關的鏡頭。如何選擇鏡頭是一個尚未解決的關鍵問題：稀疏取樣可能會錯過關鍵細節，而全面取樣會使模型被無關的內容淹沒，從而導致對視頻的誤解。為了解決這個問題，我們提出了鏈式鏡頭提示（CoS）。其關鍵思想是將鏡頭選擇框架化為測試時的視覺提示優化，通過優化鏡頭-任務對齊來選擇適應視頻理解語義任務的鏡頭。CoS包含兩個關鍵部分：（1）一個執行虛擬時間定位的二元視頻摘要機制，發現一種二元編碼以識別與任務相關的鏡頭，以及（2）一個視頻共推理模塊，該模塊利用二元編碼將與任務相關的正面鏡頭與無關的負面鏡頭進行配對（學習對齊）。它將優化的鏡頭選擇嵌入到原始視頻中，從而專注於相關上下文以優化對長視頻的理解。在三個基準和五個數據集上的實驗證明了CoS的有效性和適應性。代碼位於https://lwpyh.github.io/CoS。

English

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

CoS：連鎖式拍攝提示用於長視頻理解

CoS: Chain-of-Shot Prompting for Long Video Understanding

摘要

Support