从千言万语生成图像：通过结构化描述增强文本到图像生成

摘要

文本到图像模型已迅速从休闲创作工具发展为专业级系统，实现了前所未有的图像质量与真实感。然而，大多数模型仍被训练为将简短提示映射为精细图像，这导致稀疏文本输入与丰富视觉输出之间出现断层。此种不匹配降低了可控性——模型常随意补全缺失细节，偏向普通用户偏好，限制了专业应用的精确度。我们通过训练首个基于长结构化描述的开源文本到图像模型解决这一局限：每个训练样本均标注有相同的细粒度属性集。该设计最大化表达覆盖范围，并实现对视觉要素的解耦控制。为高效处理长描述，我们提出DimFusion融合机制，在不增加标记长度的前提下整合轻量化大语言模型的中间标记。同时引入文本瓶颈重建评估协议，通过评估真实图像在描述-生成循环中的重建质量，直接衡量可控性与表达能力，即使在现有评估方法失效的超长描述场景下仍适用。最终，我们通过训练大规模模型FIBO验证贡献，在开源模型中实现了最先进的提示对齐效果。模型权重已公开于https://huggingface.co/briaai/FIBO。

English

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO