다음 시각적 세분화 생성

초록

본 논문에서는 이미지를 구조화된 시퀀스로 분해하여 각 시퀀스 요소가 동일한 공간 해상도를 공유하지만, 사용된 고유 토큰의 수가 달라 서로 다른 수준의 시각적 세부 정보를 포착하는 새로운 이미지 생성 접근법을 제안한다. 이미지 생성은 우리가 새롭게 도입한 Next Visual Granularity(NVG) 생성 프레임워크를 통해 수행되며, 이 프레임워크는 빈 이미지에서 시작하여 전역 레이아웃에서 세부 사항까지 구조화된 방식으로 점진적으로 시각적 세부 정보 시퀀스를 생성한다. 이 반복적인 과정은 다중 세부 정보 수준에 걸쳐 생성 과정을 세밀하게 제어할 수 있는 계층적, 레이어드 표현을 인코딩한다. 우리는 ImageNet 데이터셋에서 클래스 조건부 이미지 생성을 위해 일련의 NVG 모델을 학습시키고 명확한 스케일링 행동을 관찰했다. VAR 시리즈와 비교했을 때, NVG는 FID 점수(3.30 -> 3.03, 2.57 -> 2.44, 2.09 -> 2.06) 측면에서 지속적으로 더 나은 성능을 보였다. 또한 NVG 프레임워크의 능력과 잠재력을 보여주기 위해 광범위한 분석을 수행했다. 우리의 코드와 모델은 공개될 예정이다.

English

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

다음 시각적 세분화 생성

Next Visual Granularity Generation

초록

Support