PixVerve: 대규모 고품질 데이터셋을 활용한 100MP 네이티브 초고해상도 이미지 생성의 발전

초록

최근 Text-to-Image (T2I) 모델은 약 1K 및 2K 해상도에서 눈에 띄는 진전을 보여왔다. 더 나은 시각적 경험에 대한 극심한 요구와 이미징 기술의 급속한 발전에 힘입어, 초고해상도(UHR) 이미지 생성에 대한 수요가 크게 증가하였다. 그러나 UHR 이미지 생성은 고해상도 콘텐츠의 희소성과 복잡성으로 인해 큰 도전 과제를 안고 있다. 본 논문에서는 먼저 신중하게 설계된 데이터 파이프라인을 통해 구축된 고품질 오픈소스 UHR T2I 데이터셋인 PixVerve-95K를 소개한다. 이 데이터셋은 다양한 시나리오(각 이미지는 최소 1억 화소 이상)에 걸친 95,000개의 이미지와 7차원의 주석을 포함한다. 이 대규모 이미지-텍스트 데이터셋을 기반으로, 우리는 세 가지 훈련 방식을 통해 다양한 T2I 기반 모델을 네이티브 1억 화소 생성으로 확장하는 선구적인 단계를 수행한다. 마지막으로, 기존 평가 지표와 멀티모달 대규모 언어 모델 기반 평가를 모두 활용하여, 제안하는 PixVerve-Bench 벤치마크는 시각적 품질과 의미적 정렬을 포괄하는 UHR 이미지에 대한 종합적인 평가 프로토콜을 수립한다. 우리의 벤치마크에 대한 광범위한 실험 결과와 훈련 전략에 대한 건설적인 탐구는 향후 돌파구에 대한 귀중한 통찰력을 함께 제공한다.

English

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.