Lens: 基礎的なテキスト画像生成モデルの訓練効率を再考する

要旨

Lensは、38億パラメータのT2Iモデルであり、60億パラメータを超える最先端モデルと同等以上の性能を複数のベンチマークで達成しつつ、トレーニングに必要な計算量を大幅に削減しています。例えば、Lensのトレーニング計算量はZ-Imageの約19.3%に過ぎません。このトレーニング効率の高さは、コンパクトなモデルサイズに加えて、以下の2つの主要な戦略に起因します。第一に、各トレーニングバッチにおけるデータ情報密度を最大化するため、(i) キャプションがGPT-4.1により生成され、平均約109語を含む、800万の高密度キャプション付き画像テキストペアからなるデータセットLens-800Mでトレーニングすることで、従来の短いキャプションよりも豊かな意味的監督を提供し、(ii) 複数の解像度と多様なアスペクト比を持つ画像から各バッチを構成することで、各最適化ステップの実効的な視覚的カバレッジを拡大しています。第二に、より優れた潜在表現を提供するセマンティックVAEの採用や、最適化を加速し英語のみのトレーニングデータから多言語汎化を可能にする強力な言語エンコーダの使用など、注意深いアーキテクチャ選択を通じて収束速度を改善しています。事前トレーニング後には、分類学に基づくプロンプト（Lens-RL-8K）と構造化報酬ルーブリックを用いた強化学習を適用し、アーティファクトを抑制して視覚品質を向上させます。また、トレーニング不要のシステムプロンプト探索を用いたリーズナーモジュールにより、ユーザー要求とモデルの整合性を高め、蒸留ベースの高速化により4ステップ推論を実現します。効率的なトレーニングと体系的な最適化により、Lensは1:2から2:1までの任意のアスペクト比と、最大1440^2の解像度に汎化し、複数の一般的な言語でのプロンプトに対応します。コンパクトなサイズのおかげで、Lensは単一のNVIDIA H100 GPU上で1024^2の画像を3.15秒で生成し、蒸留版のターボバージョンでは4ステップ生成を0.84秒で実行します。

English

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.