视角：重新思考基础文生图模型的训练效率

摘要

我们推出了Lens，一个拥有38亿参数的文生图模型。该模型在多项基准测试中，性能与参数超过60亿的最先进模型相当，甚至在某些方面超越它们，同时所需的训练计算量显著更少。例如，Lens仅需Z-Image约19.3%的训练计算量。Lens的训练效率源于其紧凑模型尺寸之外的两项关键策略。首先，我们通过以下方式最大化每个训练批次的数据信息密度：(i) 在Lens-800M数据集上训练，该数据集包含8亿个由GPT-4.1生成的密集描述图像-文本对，其描述平均约含109个单词，提供了比传统短描述更丰富的语义监督；(ii) 每个批次由多种分辨率和不同宽高比的图像构成，从而扩大每个优化步骤的有效视觉覆盖范围。其次，我们通过精心的架构选择提高收敛速度，包括采用能够提供更好潜在表示的语义VAE，以及使用强大的语言编码器来加速优化，同时实现仅从英文训练数据中泛化到多语言的能力。预训练后，我们应用基于分类法提示的强化学习（Lens-RL-8K）和结构化奖励评分标准来抑制伪影并提高视觉质量；一个无需训练的推理模块，通过系统提示搜索来更好地将用户请求与模型对齐；以及基于蒸馏的加速方法，实现4步推理。通过高效的训练和系统优化，Lens可泛化到1:2到2:1的任意宽高比和最高1440²的分辨率，并支持多种常用语言的提示。得益于其紧凑的尺寸，Lens在单个NVIDIA H100 GPU上生成1024²图像仅需3.15秒，而其蒸馏涡轮版本可在0.84秒内完成4步生成。

English

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.