PixVerve：利用大规模高品質資料集將原生超高解析度影像生成推進至1億像素

摘要

文字生成图像（T2I）模型近期在约1K至2K解析度領域已取得顯著進展。隨著對更優視覺體驗的極致追求以及影像技術的快速發展，超高解析度（UHR）影像生成的需求大幅增加。然而，由於高解析度內容的稀缺性與複雜性，UHR影像生成面臨巨大挑戰。本文首先介紹PixVerve-95K——一個高品質、開源的UHR T2I資料集，其透過精心設計的資料管道進行建構，包含95K張涵蓋多種場景（每張影像最小像素數達1億）並附有七維標註的影像。基於此大規模影像-文字資料集，我們率先嘗試透過三種訓練方案，將多種T2I基礎模型擴展至原生1億像素的生成能力。最後，結合傳統指標與多模態大型語言模型評估方法，我們提出的PixVerve-Bench基準測試建立了涵蓋視覺品質與語義一致性的UHR影像全面評估協定。在該基準測試上的廣泛實驗結果，以及對訓練策略的建設性探索，共同為未來突破提供了寶貴見解。

English

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.