EVATok: Adaptieve Videotokenisatie met Variabele Lengte voor Efficiënte Visuele Autoregressieve Generatie

Samenvatting

Autoregressieve (AR) videogeneratieve modellen zijn afhankelijk van videotokenizers die pixels comprimeren tot discrete tokenreeksen. De lengte van deze tokenreeksen is cruciaal voor het balanceren van reconstructiekwaliteit en computationele kosten voor downstream-generatie. Traditionele videotokenizers passen een uniforme token-toewijzing toe over temporele blokken van verschillende video's, waarbij vaak tokens worden verspild aan eenvoudige, statische of repetitieve segmenten, terwijl dynamische of complexe segmenten onderbedeeld blijven. Om deze inefficiëntie aan te pakken, introduceren we EVATok, een raamwerk voor het produceren van Efficiënte Video Adaptieve Tokenizers. Ons raamwerk schat optimale token-toewijzingen voor elke video in om de beste kwaliteit-kosten verhouding te bereiken, ontwikkelt lichtgewicht routers voor snelle voorspelling van deze optimale toewijzingen, en traint adaptieve tokenizers die video's coderen op basis van de door routers voorspelde toewijzingen. We tonen aan dat EVATok aanzienlijke verbeteringen biedt in efficiëntie en algehele kwaliteit voor videoreconstructie en downstream AR-generatie. Versterkt door ons geavanceerde trainingsrecept dat videosemantische encoders integreert, behaalt EVATok superieure reconstructie en state-of-the-art klasse-naar-video-generatie op UCF-101, met een besparing van ten minste 24,4% in gemiddeld tokenverbruik vergeleken met de vorige state-of-the-art LARP en onze baseline met vaste lengte.

English

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce EVATok, a framework to produce Efficient Video Adaptive Tokenizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

EVATok: Adaptieve Videotokenisatie met Variabele Lengte voor Efficiënte Visuele Autoregressieve Generatie

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Samenvatting

Support