InsightTok: 자기회귀 이미지 생성을 위한 이산 토큰화에서 텍스트 및 얼굴 충실도 향상

초록

텍스트와 얼굴은 시각적 생성에서 지각적으로 가장 두드러지고 실질적으로 중요한 패턴 중 하나이지만, 이산 토큰화를 기반으로 구축된 자기회귀 생성기에서는 여전히 어려움을 겪고 있다. 핵심 병목 현상은 토크나이저에 있다: 과도한 다운샘플링과 양자화는 종종 판독 가능한 문자 형태와 독특한 얼굴 특징을 보존하는 데 필요한 세밀한 구조를 버린다. 우리는 이러한 격차를 표준 이산 토크나이저 목표가 텍스트 가독성 및 얼굴 충실도와 약하게 정렬되어 있기 때문으로 본다. 이러한 목표는 일반적으로 다양한 콘텐츠를 균일하게 압축하면서 일반적인 재구성을 최적화하기 때문이다. 이를 해결하기 위해, 우리는 지역화된 콘텐츠 인식 지각적 손실을 통해 텍스트와 얼굴 충실도를 향상시키는 간단하면서도 효과적인 이산 시각적 토큰화 프레임워크인 InsightTok을 제안한다. 16k의 컴팩트한 코드북과 16배 다운샘플링 비율을 갖춘 InsightTok은 일반 재구성 품질을 저하시키지 않으면서 텍스트 및 얼굴 재구성에서 이전 토크나이저를 크게 능가한다. 이러한 이점은 InsightAR에서 자기회귀 이미지 생성으로 일관되게 전이되어, 더 선명한 텍스트와 더 충실한 얼굴 디테일을 가진 이미지를 생성한다. 전반적으로, 우리의 결과는 이산 이미지 생성을 발전시키기 위한 토크나이저 훈련에서 특화된 감독의 잠재력을 강조한다.

English

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.