Lens: 기반 텍스트-이미지 모델의 학습 효율성 재조명

초록

저자는 3.8B 파라미터 규모의 T2I 모델인 Lens를 소개한다. Lens는 다양한 벤치마크에서 6B 이상의 파라미터를 가진 최첨단 모델과 경쟁력 있는 성능을 보이며, 일부 항목에서는 이를 능가하면서도 훨씬 적은 학습 연산량을 필요로 한다. 예를 들어, Lens는 Z-Image가 사용하는 학습 연산량의 약 19.3%만을 요구한다. Lens의 학습 효율성은 컴팩트한 모델 크기 외에도 두 가지 핵심 전략에서 비롯된다. 첫째, 각 학습 배치 내 데이터 정보 밀도를 최대화한다. 이를 위해 (i) GPT-4.1에 의해 생성된 평균 약 109단어로 구성된 밀집 캡션을 가진 8억 개의 이미지-텍스트 쌍 데이터셋인 Lens-800M에서 학습하여 기존의 짧은 캡션보다 풍부한 의미적 지도(semantic supervision)를 제공하고, (ii) 다양한 해상도와 다양한 종횡비를 가진 이미지로 각 배치를 구성하여 각 최적화 단계의 유효 시각적 범위를 확장한다. 둘째, 더 나은 잠재 표현을 제공하는 의미적 VAE(semantic VAE) 채택과 영문 학습 데이터만으로도 다국어 일반화를 가능하게 하면서 최적화를 가속하는 강력한 언어 인코더 사용을 포함한 신중한 아키텍처 선택을 통해 수렴 속도를 향상시킨다. 사전 학습 후에는 분류체계 기반 프롬프트(taxonomy-driven prompts)와 체계화된 보상 루브릭(reward rubrics)을 활용한 강화학습(Lens-RL-8K)을 적용하여 아티팩트를 억제하고 시각적 품질을 개선하고, 학습 없는 시스템 프롬프트 탐색(training-free system prompt search)이 포함된 추론 모듈(reasoner module)을 통해 사용자 요청과 모델을 더 잘 정렬하며, 증류 기반 가속(distillation-based acceleration)을 통해 4단계 추론을 구현한다. 효율적인 학습과 체계적인 최적화를 통해 Lens는 1:2에서 2:1까지의 다양한 종횡비와 최대 1440² 해상도에 일반화되며, 여러 일반 사용 언어의 프롬프트를 지원한다. 컴팩트한 크기 덕분에 Lens는 단일 NVIDIA H100 GPU에서 1024² 이미지를 3.15초에 생성하며, 증류 터보 버전(distilled turbo version)은 0.84초에 4단계 생성을 수행한다.

English

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.