NeoBabel:面向视觉生成的多语言开放塔式架构
NeoBabel: A Multilingual Open Tower for Visual Generation
July 8, 2025
作者: Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek
cs.AI
摘要
文本到图像生成技术的进步长期以来以英语为中心,为非英语使用者设置了障碍,并加剧了数字不平等。尽管现有系统依赖于翻译管道,但这些方法引入了语义漂移、计算开销和文化错位。我们推出了NeoBabel,一种创新的多语言图像生成框架,在性能、效率和包容性方面树立了新的帕累托前沿,支持六种语言:英语、中文、荷兰语、法语、印地语和波斯语。该模型通过大规模多语言预训练和高分辨率指令微调相结合的方式进行训练。为了评估其能力,我们将两个仅限英语的基准扩展为多语言等效版本:m-GenEval和m-DPG。NeoBabel在保持强大英语能力的同时,实现了最先进的多语言性能,在m-GenEval上得分为0.75,在m-DPG上得分为0.68。值得注意的是,它在英语任务上与领先模型表现相当,而在多语言基准上分别超出它们+0.11和+0.09,尽管这些模型基于多语言基础LLM构建。这证明了我们针对对齐训练的有效性,以保持和扩展跨语言泛化能力。我们进一步引入了两个新指标,以严格评估多语言对齐和对混合代码提示的鲁棒性。值得注意的是,NeoBabel在体积缩小2-4倍的同时,与仅限英语的模型相当或更优。我们发布了一个开源工具包,包括所有代码、模型检查点、精选的1.24亿多语言文本-图像对数据集,以及标准化的多语言评估协议,以推动包容性AI研究。我们的工作表明,多语言能力不是一种权衡,而是提升生成式AI鲁棒性、效率和文化保真度的催化剂。
English
Text-to-image generation advancements have been predominantly
English-centric, creating barriers for non-English speakers and perpetuating
digital inequities. While existing systems rely on translation pipelines, these
introduce semantic drift, computational overhead, and cultural misalignment. We
introduce NeoBabel, a novel multilingual image generation framework that sets a
new Pareto frontier in performance, efficiency and inclusivity, supporting six
languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is
trained using a combination of large-scale multilingual pretraining and
high-resolution instruction tuning. To evaluate its capabilities, we expand two
English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG.
NeoBabel achieves state-of-the-art multilingual performance while retaining
strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG.
Notably, it performs on par with leading models on English tasks while
outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though
these models are built on multilingual base LLMs. This demonstrates the
effectiveness of our targeted alignment training for preserving and extending
crosslingual generalization. We further introduce two new metrics to rigorously
assess multilingual alignment and robustness to code-mixed prompts. Notably,
NeoBabel matches or exceeds English-only models while being 2-4x smaller. We
release an open toolkit, including all code, model checkpoints, a curated
dataset of 124M multilingual text-image pairs, and standardized multilingual
evaluation protocols, to advance inclusive AI research. Our work demonstrates
that multilingual capability is not a trade-off but a catalyst for improved
robustness, efficiency, and cultural fidelity in generative AI.