자율 회귀적 이미지 생성을 위한 효과적인 시각적 토크나이저로서의 비전 파운데이션 모델

초록

시각적 이해를 위해 전통적으로 사용되어 온 사전 학습된 비전 파운데이션 모델의 강력한 표현력을 활용하여, 우리는 새로운 방향을 탐구합니다: 이러한 모델 위에 직접 이미지 토크나이저를 구축하는 것, 이는 크게 미개척된 영역입니다. 구체적으로, 우리는 토크나이저의 인코더로 고정된 비전 파운데이션 모델을 사용합니다. 그 효과를 높이기 위해 두 가지 핵심 구성 요소를 도입합니다: (1) 정규 2D 그리드에서 사전 학습된 특징의 중복성을 줄이는 지역 적응 양자화 프레임워크, 그리고 (2) 토크나이저의 출력을 파운데이션 모델의 표현과 일치시켜 의미적 충실도를 보존하는 의미론적 재구성 목표. 이러한 설계를 기반으로, 우리가 제안한 이미지 토크나이저인 VFMTok는 이미지 재구성 및 생성 품질에서 상당한 개선을 이루었으며, 토큰 효율성도 향상시켰습니다. 또한, 이는 자기회귀(AR) 생성을 더욱 촉진하여 ImageNet 벤치마크에서 gFID 2.07을 달성하고, 모델 수렴 속도를 세 배로 가속화하며, 분류자 없는 지도(CFG) 없이도 고충실도의 클래스 조건부 합성을 가능하게 합니다. 코드는 커뮤니티의 이익을 위해 공개될 예정입니다.

English

Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

자율 회귀적 이미지 생성을 위한 효과적인 시각적 토크나이저로서의 비전 파운데이션 모델

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

초록

Support