CoS: Ketting-van-Shot Aansporing voor het Begrijpen van Lange Video's

Samenvatting

Multi-modale Grote Taalmodellen (MLLM's) hebben moeite met lange video's vanwege de noodzaak van overmatige visuele tokens. Deze tokens overschrijden aanzienlijk de contextlengte van MLLM's, resulterend in het vullen met overbodige, taak-onrelevante shots. Hoe shots te selecteren is een onopgelost kritisch probleem: spaarzaam bemonsteren riskeert het missen van belangrijke details, terwijl uitputtend bemonsteren het model overweldigt met irrelevante inhoud, wat leidt tot misverstanden in de video. Om dit probleem op te lossen, stellen we Chain-of-Shot prompting (CoS) voor. Het belangrijkste idee is om shotselectie te kaderen als optimalisatie van visuele prompts op testtijd, waarbij shots adaptief aan videobegrip worden gekozen op basis van semantische taak door shots-taakuitlijning te optimaliseren. CoS bestaat uit twee belangrijke onderdelen: (1) een binair videosamenvattingsmechanisme dat pseudo-temporele verankering uitvoert, waarbij een binaire codering wordt ontdekt om taakrelevante shots te identificeren, en (2) een videoco-redeneringsmodule die de binaire codering inzet om (leren uitlijnen) taakrelevante positieve shots met irrelevante negatieve shots te koppelen. Het integreert de geoptimaliseerde shotselecties in de originele video, waardoor een focus op relevante context mogelijk is om begrip van lange video's te optimaliseren. Experimenten over drie baselines en vijf datasets tonen de effectiviteit en aanpasbaarheid van CoS aan. De code is beschikbaar op https://lwpyh.github.io/CoS.

English

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

CoS: Ketting-van-Shot Aansporing voor het Begrijpen van Lange Video's

CoS: Chain-of-Shot Prompting for Long Video Understanding

Samenvatting

Support