從千言萬語生成圖像:透過結構化描述提升文字轉圖像技術
Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions
November 10, 2025
作者: Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady
cs.AI
摘要
文字生成影像模型已從休閒創作工具迅速發展為專業級系統,實現了前所未有的影像品質與真實感。然而,多數模型仍以簡短提示詞生成細節影像的訓練方式,導致稀疏文字輸入與豐富視覺輸出之間存在落差。這種不匹配降低了可控性——模型常隨機填補缺失細節,偏向普通用戶偏好,限制了專業應用的精確度。我們通過訓練首個基於長結構化描述文本的開源文字生成影像模型解決此問題:每個訓練樣本均以相同組別的細粒度屬性進行標註,此設計能最大化表達覆蓋率並實現對視覺要素的解耦控制。為高效處理長文本,我們提出DimFusion融合機制,在不增加標記長度的前提下整合輕量級LLM的中間標記。同時引入「文本瓶頸重建」(TaBR)評估協議,通過衡量真實影像在描述-生成循環中的重建質量,直接量化可控性與表達力,即使在現有評估方法失效的超長文本場景下仍能精準測評。最終,我們通過訓練大規模模型FIBO驗證貢獻,在開源模型中實現了最先進的提示詞對齊效果。模型權重公開於:https://huggingface.co/briaai/FIBO
English
Text-to-image models have rapidly evolved from casual creative tools to
professional-grade systems, achieving unprecedented levels of image quality and
realism. Yet, most models are trained to map short prompts into detailed
images, creating a gap between sparse textual input and rich visual outputs.
This mismatch reduces controllability, as models often fill in missing details
arbitrarily, biasing toward average user preferences and limiting precision for
professional use. We address this limitation by training the first open-source
text-to-image model on long structured captions, where every training sample is
annotated with the same set of fine-grained attributes. This design maximizes
expressive coverage and enables disentangled control over visual factors. To
process long captions efficiently, we propose DimFusion, a fusion mechanism
that integrates intermediate tokens from a lightweight LLM without increasing
token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR)
evaluation protocol. By assessing how well real images can be reconstructed
through a captioning-generation loop, TaBR directly measures controllability
and expressiveness, even for very long captions where existing evaluation
methods fail. Finally, we demonstrate our contributions by training the
large-scale model FIBO, achieving state-of-the-art prompt alignment among
open-source models. Model weights are publicly available at
https://huggingface.co/briaai/FIBO