Een LMM voor efficiënte videobegrip via versterkte compressie van videokubussen

Samenvatting

Grote Multimodale Modellen (LMMs) nemen videobeelden uniform waar, wat leidt tot computationele inefficiëntie voor video's met inherent variërende temporele informatie dichtheid. Dit artikel introduceert Quicksviewer, een LMM met een nieuw waarnemingsparadigma dat een video van niet-uniforme dichtheid verdeelt in variërende blokken met behulp van Gumbel Softmax, gevolgd door een uniforme hermonstering voor elk blok om efficiënt videobegrip te bereiken. Deze eenvoudige en intuïtieve aanpak comprimeert video dynamisch online op basis van de temporele dichtheid, waardoor ruimtelijk-temporele redundantie aanzienlijk wordt verminderd (totale compressie ratio van 45 keer), terwijl efficiënte training met een groot receptief veld mogelijk wordt gemaakt. We trainen het model vanuit een taalbackbone door drie progressieve fasen, elk met lange video's van gemiddeld 420s/1fps dankzij de waarnemingsefficiëntie. Met slechts 0,8 miljoen totale video-tekst samples voor training, presteert ons model beter dan de directe baseline die een vaste partitiestrategie gebruikt, met een maximale nauwkeurigheidstoename van 8,72, wat de effectiviteit in prestaties aantoont. Op Video-MME bereikt Quicksviewer state-of-the-art (SOTA) onder bescheiden sequentielengtes met slechts tot 5% van de tokens per frame die door de baselines worden vereist. Met dit paradigma laat het opschalen van het aantal invoerframes een duidelijke machtswet van de modelcapaciteiten zien. Het is ook empirisch bevestigd dat de segmenten gegenereerd door het blokkeringsnetwerk kunnen helpen bij het analyseren van continue gebeurtenissen in video's.

English

Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present Quicksviewer, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45times compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.

Een LMM voor efficiënte videobegrip via versterkte compressie van videokubussen

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Samenvatting

Support