이미지를 집합으로 토큰화하기

초록

본 논문은 집합 기반 토큰화와 분포 모델링을 통해 이미지 생성을 위한 근본적으로 새로운 패러다임을 제안합니다. 고정 위치 잠재 코드에 균일한 압축 비율로 이미지를 직렬화하는 기존 방법과 달리, 우리는 지역적 의미론적 복잡성에 기반하여 동적으로 코딩 용량을 할당하는 비순서 토큰 집합 표현을 도입합니다. 이 TokenSet은 전역 컨텍스트 집계를 강화하고 지역적 섭동에 대한 견고성을 향상시킵니다. 이산 집합 모델링의 중요한 과제를 해결하기 위해, 우리는 집합을 합계 제약 조건이 있는 고정 길이 정수 시퀀스로 쌍방향 변환하는 이중 변환 메커니즘을 고안했습니다. 더 나아가, 이산 값, 고정 시퀀스 길이, 합계 불변성을 동시에 처리하는 최초의 프레임워크인 Fixed-Sum Discrete Diffusion을 제안하여 효과적인 집합 분포 모델링을 가능하게 합니다. 실험 결과, 우리의 방법이 의미론적 인식 표현과 생성 품질에서 우수함을 입증했습니다. 새로운 표현 및 모델링 전략에 걸친 우리의 혁신은 전통적인 순차적 토큰 패러다임을 넘어 시각적 생성을 발전시킵니다. 우리의 코드와 모델은 https://github.com/Gengzigang/TokenSet에서 공개적으로 이용 가능합니다.

English

This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at https://github.com/Gengzigang/TokenSet.

이미지를 집합으로 토큰화하기

Tokenize Image as a Set

초록

Support