UltraFlux:面向多種長寬比高品質原生4K文本到圖像生成的數據-模型協同設計
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
November 22, 2025
作者: Tian Ye, Song Fei, Lei Zhu
cs.AI
摘要
擴散轉換器近期在1K解析度的文字生成圖像任務中表現卓越,但我們發現將其原生擴展至4K解析度並涵蓋多樣縱橫比時,會暴露出一個涉及位置編碼、VAE壓縮與最佳化過程的緊密耦合失效模式。單獨解決任一因素仍會導致大量品質損失。因此我們採用數據-模型協同設計視角,提出基於Flux架構的UltraFlux擴散轉換器:其原生支援4K訓練,並採用具備多縱橫比控制覆蓋的百萬級4K圖像數據集MultiAspect-4K-1M,該數據集同時包含雙語描述文本以及豐富的視覺語言模型/圖像品質評估元數據,可實現解析度與縱橫比感知的取樣策略。在模型層面,UltraFlux整合四大創新:(i) 採用Resonance二維旋轉位置編碼與YaRN技術,實現訓練窗口感知、頻率感知及縱橫比感知的4K位置編碼;(ii) 透過簡潔的非對抗式VAE訓練後優化方案提升4K重建保真度;(iii) 設計信噪比感知的Huber小波目標函數,重新平衡時間步與頻帶間的梯度分佈;(iv) 建立階段式美學課程學習策略,將高美學標準的監督集中於模型先驗主導的高噪聲階段。這些組件共同構建出穩定且細節保留能力強的4K擴散轉換器,可泛化應用於寬屏、方屏與豎屏等多元縱橫比場景。在4096解析度的Aesthetic-Eval基準測試與多縱橫比4K設定下,UltraFlux在保真度、美學品質與語意對齊指標上持續超越主流開源模型,若搭配大型語言模型提示詞優化器,其表現更可媲美或超越專有模型Seedream 4.0。
English
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.