GigaTok: 자율 회귀 이미지 생성을 위해 시각적 토크나이저를 30억 파라미터로 확장

초록

자기회귀(AR) 이미지 생성에서 시각적 토크나이저는 이미지를 간결한 이산 잠재 토큰으로 압축하여, 다음 토큰 예측을 통한 시각적 생성을 위한 하위 자기회귀 모델의 효율적인 학습을 가능하게 합니다. 시각적 토크나이저의 규모를 확장하면 이미지 재구성 품질이 향상되지만, 종종 하위 생성 품질이 저하되는 문제가 발생합니다. 이는 기존 문헌에서 충분히 다루어지지 않은 과제입니다. 이를 해결하기 위해, 우리는 GigaTok을 소개합니다. GigaTok은 시각적 토크나이저의 규모를 확장할 때 이미지 재구성, 생성, 그리고 표현 학습을 동시에 개선하는 첫 번째 접근법입니다. 우리는 잠재 공간의 증가하는 복잡성을 재구성 대 생성 딜레마의 주요 요인으로 식별했습니다. 이를 완화하기 위해, 우리는 의미론적 정규화를 제안합니다. 이는 토크나이저 특징을 사전 학습된 시각적 인코더의 의미론적으로 일관된 특징과 정렬시킵니다. 이 제약은 규모 확장 중에 잠재 공간의 과도한 복잡성을 방지하여, 재구성과 하위 자기회귀 생성 모두에서 일관된 개선을 가져옵니다. 의미론적 정규화를 기반으로, 우리는 토크나이저 규모 확장을 위한 세 가지 주요 관행을 탐구합니다: (1) 더 나은 확장성을 위해 1D 토크나이저 사용, (2) 인코더와 디코더를 모두 확장할 때 디코더 확장 우선, (3) 10억 규모 토크나이저의 학습 안정화를 위해 엔트로피 손실 사용. 30억 개의 매개변수로 규모를 확장함으로써, GigaTok은 재구성, 하위 AR 생성, 그리고 하위 AR 표현 품질에서 최첨단 성능을 달성합니다.

English

In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 space billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

GigaTok: 자율 회귀 이미지 생성을 위해 시각적 토크나이저를 30억 파라미터로 확장

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

초록

Support