ChatPaper.aiChatPaper

透鏡:重新思考基礎文字到圖像模型的訓練效率

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

May 20, 2026
作者: Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen
cs.AI

摘要

我們推出Lens,一個擁有3.8B參數的T2I模型,其性能在多項基準測試中可與參數超過6B的頂尖模型匹敵,甚至在某些方面超越它們,同時所需訓練計算量大幅減少。例如,Lens僅需約Z-Image訓練計算量的19.3%。Lens的訓練效率除了來自其緊湊的模型規模外,更源於兩項關鍵策略。首先,我們最大化每個訓練批次中的數據資訊密度,具體做法包括:(i) 在Lens-800M數據集上進行訓練——該數據集包含8億對密集標題的圖像-文本對,標題由GPT-4.1生成,平均約109個單詞,相較於傳統的簡短標題,能提供更豐富的語義監督;(ii) 每個批次由多種解析度和多樣寬高比的圖像構成,從而擴大每個優化步驟的有效視覺覆蓋範圍。其次,我們透過精心設計的架構選擇來提升收斂速度,包括採用能提供更佳潛在表示的語義VAE,以及使用強大的語言編碼器——它在加速優化的同時,還能從僅含英文的訓練數據中實現多語言泛化。在預訓練之後,我們應用帶有分類學驅動提示(Lens-RL-8K)的強化學習與結構化獎勵評分標準,以抑制偽影並提升視覺品質;搭配無需訓練的系統提示搜尋推理器模組,以更好地將用戶請求與模型對齊;以及基於蒸餾的加速技術,實現4步推理。透過高效的訓練與系統化優化,Lens能夠泛化至1:2到2:1的任意寬高比,以及最高1440^2的解析度,並支援多種常用語言的提示。得益於其緊湊的規模,Lens在單張NVIDIA H100 GPU上生成一張1024^2的圖像只需3.15秒,而其蒸餾後的Turbo版本則可在0.84秒內完成4步生成。
English
We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.