EVATok: 효율적인 시각 자가회귀 생성을 위한 적응형 길이 비디오 토큰화

초록

자기회귀(AR) 비디오 생성 모델은 픽셀을 이산 토큰 시퀀스로 압축하는 비디오 토크나이저에 의존합니다. 이러한 토큰 시퀀스의 길이는 재구성 품질과 하류 생성 계산 비용 간의 균형을 맞추는 데 중요합니다. 기존 비디오 토크나이저는 서로 다른 비디오의 시간적 블록에 걸쳐 균일한 토큰 할당을 적용하는데, 이로 인해 단순하거나 정적이거나 반복적인 세그먼트에는 토큰을 낭비하는 반면, 역동적이거나 복잡한 세그먼트에는 토큰이 부족하게 할당되는 경우가 많습니다. 이러한 비효율성을 해결하기 위해 우리는 효율적인 비디오 적응형 토크나이저(EVATok) 프레임워크를 소개합니다. 우리의 프레임워크는 최적의 품질-비용 절충을 달성하기 위해 각 비디오에 대한 최적의 토큰 할당을 추정하고, 이러한 최적 할당을 빠르게 예측하기 위한 경량 라우터를 개발하며, 라우터가 예측한 할당을 기반으로 비디오를 인코딩하는 적응형 토크나이저를 학습합니다. 우리는 EVATok이 비디오 재구성 및 하류 AR 생성에서 효율성과 전반적인 품질을 크게 향상시킴을 입증합니다. 비디오 의미론적 인코더를 통합한 고급 학습 방법론으로 강화된 EVATok은 UCF-101에서 우수한 재구성 성능과 최첨단 클래스-비디오 생성 성능을 달성하며, 기존 최신 기술인 LARP 및 우리의 고정 길이 베이스라인 대비 평균 토큰 사용량을 최소 24.4% 절감합니다.

English

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce EVATok, a framework to produce Efficient Video Adaptive Tokenizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

EVATok: 효율적인 시각 자가회귀 생성을 위한 적응형 길이 비디오 토큰화

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

초록

Support